当前位置：首页 > news >正文

访问国外网站很慢建设博物馆网站

news 2026/4/17 20:31:01

访问国外网站很慢,建设博物馆网站,wordpress主题标签生成,中牟郑州网站建设毒蘑菇的二元预测您提供了很多关于不同二元分类任务的资源和链接#xff0c;看起来这些都是Kaggle竞赛中的参考资料和高分解决方案。为了帮助您更好地利用这些资源#xff0c;这里是一些关键点的总结#xff1a; Playground Season 4 Episode 8 主要关注的竞赛: 使用银行…毒蘑菇的二元预测您提供了很多关于不同二元分类任务的资源和链接看起来这些都是Kaggle竞赛中的参考资料和高分解决方案。为了帮助您更好地利用这些资源这里是一些关键点的总结 Playground Season 4 Episode 8 主要关注的竞赛: 使用银行流失数据集进行二元分类。数据集: 已经重新组织并发布供参考。热门解决方案: LightGBM 和 CatBoost 模型 (得分 0.8945)。XGBoost 和随机森林模型。神经网络分类模型。其他相关的竞赛和资源使用生物信号对吸烟者状况进行二元预测 EDA 和特征工程。XGBoost 模型。使用软件缺陷数据集进行二元分类 EDA 和建模。机器故障的二元分类 EDA, 集成学习, ML pipeline, SHAP 分析。使用表格肾结石预测数据集进行二元分类多种模型对比。特色竞赛美国运通 - 违约预测特征工程和LightGBM模型。房屋信贷违约风险完整的EDA和特征重要性分析。竞争指标 - Mathews 相关性系数定义: 衡量二元分类器输出质量的度量。资源: Wikipedia 关于 Phi 系数的页面。Voxco 博客关于 Matthews 相关性系数的文章。一篇关于 Matthews 相关性系数在生物数据挖掘中的应用的论文。Scikit-learn 文档中关于 Matthews 相关性系数的说明。希望这些信息能够帮助您更有效地开始学习和参与这些竞赛。如果您有具体的问题或者需要针对某个特定部分的帮助请告诉我 # 加载训练数据 train_data pd.read_csv(train.csv)# 显示前几行数据以了解数据结构 print(train_data.head())# 查看数据的基本信息 print(train_data.info()) 步骤 2: 数据探索与可视化在这一步中我们将对数据进行更深入的探索并使用可视化工具来更好地理解数据的分布和特征之间的关系。 # 统计每种类型的蘑菇数量 print(train_data[class].value_counts())# 可视化不同类型的蘑菇数量 plt.figure(figsize(8, 6)) sns.countplot(xclass, datatrain_data) plt.title(Distribution of Mushroom Classes) plt.show()# 查看各特征与目标变量之间的关系 fig, axs plt.subplots(5, 5, figsize(20, 20)) axs axs.flatten() for i, col in enumerate(train_data.columns[1:]):sns.countplot(xcol, hueclass, datatrain_data, axaxs[i])axs[i].set_title(fDistribution of {col} by Class) plt.tight_layout() plt.show() 步骤 3: 数据预处理接下来我们将对数据进行预处理包括特征编码和其他必要的变换。 # 对类别特征进行编码 label_encoder LabelEncoder()# 遍历所有非数字特征 for col in train_data.select_dtypes(include[object]).columns:train_data[col] label_encoder.fit_transform(train_data[col])# 查看编码后的数据 print(train_data.head()) 步骤 4: 构建模型在这一步中我们将构建 LightGBM 和 CatBoost 模型并进行训练。 # 分割数据集 X train_data.drop(class, axis1) y train_data[class]# 划分训练集和验证集 X_train, X_val, y_train, y_val train_test_split(X, y, test_size0.2, random_state42)# 定义 LightGBM 模型 lgb_params {objective: binary,metric: auc,verbosity: -1,boosting_type: gbdt,num_leaves: 31,learning_rate: 0.05,feature_fraction: 0.9,bagging_fraction: 0.8,bagging_freq: 5,lambda_l1: 0.1,lambda_l2: 0.1 }# 创建 LightGBM 数据集 lgb_train lgb.Dataset(X_train, y_train) lgb_val lgb.Dataset(X_val, y_val, referencelgb_train)# 训练 LightGBM 模型 lgb_model lgb.train(lgb_params, lgb_train, num_boost_round1000, valid_sets[lgb_val], early_stopping_rounds100)# 定义 CatBoost 模型 cb_params {loss_function: Logloss,eval_metric: AUC,learning_rate: 0.05,depth: 6,l2_leaf_reg: 10,bootstrap_type: Bayesian,bagging_temperature: 0.2,random_seed: 42,allow_writing_files: False }# 创建 CatBoost 数据集 cb_train Pool(X_train, y_train) cb_val Pool(X_val, y_val)# 训练 CatBoost 模型 cb_model CatBoostClassifier(**cb_params) cb_model.fit(X_train, y_train, eval_set(X_val, y_val), use_best_modelTrue, verboseFalse) 步骤 5: 测试数据预测最后我们将使用训练好的模型对测试数据进行预测并整理预测结果以符合 Kaggle 提交的要求。 # 加载测试数据 test_data pd.read_csv(test.csv)# 对测试数据进行相同的特征编码 for col in test_data.select_dtypes(include[object]).columns:test_data[col] label_encoder.transform(test_data[col])# 获取测试数据的 ID test_ids test_data[id]# 删除不需要的列 test_features test_data.drop(id, axis1)# 使用模型进行预测 predictions_lgb lgb_model.predict(test_features, num_iterationlgb_model.best_iteration) predictions_cb cb_model.predict(test_features)# 将预测结果转换为二进制形式 predictions_lgb_binary (predictions_lgb 0.5).astype(int) predictions_cb_binary (predictions_cb 0.5).astype(int)# 整理预测结果 submission_df pd.DataFrame({id: test_ids, class: predictions_lgb_binary}) submission_df[class] submission_df[class].map({1: p, 0: e})# 保存预测结果 submission_df.to_csv(submission.csv, indexFalse) 好的现在我将为你提供一个完整的代码示例来解决“毒蘑菇的二元预测”问题。我们将使用 LightGBM 和 CatBoost 这两个模型来进行预测并遵循之前讨论过的步骤。首先我们需要安装必要的库。如果你还没有安装它们可以使用以下命令安装 pip install pandas numpy scikit-learn lightgbm catboost matplotlib seaborn plotly 下面是完整的代码示例 import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import plotly.express as px from plotly.subplots import make_subplots import plotly.graph_objects as go from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split import lightgbm as lgb from catboost import CatBoostClassifier from sklearn.metrics import matthews_corrcoef# 忽略警告 import warnings warnings.filterwarnings(ignore)# 加载数据 train_data pd.read_csv(train.csv) test_data pd.read_csv(test.csv)# 数据预处理 def preprocess_data(data):# 对类别特征进行编码label_encoder LabelEncoder()for col in data.select_dtypes(include[object]).columns:data[col] label_encoder.fit_transform(data[col])return data# 预处理训练数据 train_data preprocess_data(train_data)# 预处理测试数据 test_data preprocess_data(test_data)# 数据分割 X train_data.drop(class, axis1) y train_data[class] X_train, X_val, y_train, y_val train_test_split(X, y, test_size0.2, random_state42)# 定义 LightGBM 模型 lgb_params {objective: binary,metric: binary_logloss,verbosity: -1,boosting_type: gbdt,num_leaves: 31,learning_rate: 0.05,feature_fraction: 0.9,bagging_fraction: 0.8,bagging_freq: 5,lambda_l1: 0.1,lambda_l2: 0.1 }# 创建 LightGBM 数据集 lgb_train lgb.Dataset(X_train, y_train) lgb_val lgb.Dataset(X_val, y_val, referencelgb_train)# 训练 LightGBM 模型 lgb_model lgb.train(lgb_params, lgb_train, num_boost_round1000, valid_sets[lgb_val], early_stopping_rounds100)# 定义 CatBoost 模型 cb_params {loss_function: Logloss,eval_metric: AUC,learning_rate: 0.05,depth: 6,l2_leaf_reg: 10,bootstrap_type: Bayesian,bagging_temperature: 0.2,random_seed: 42,allow_writing_files: False }# 训练 CatBoost 模型 cb_model CatBoostClassifier(**cb_params) cb_model.fit(X_train, y_train, eval_set(X_val, y_val), use_best_modelTrue, verboseFalse)# 测试数据预测 test_ids test_data[id] test_features test_data.drop(id, axis1)# 使用 LightGBM 进行预测 predictions_lgb lgb_model.predict(test_features, num_iterationlgb_model.best_iteration) predictions_lgb_binary (predictions_lgb 0.5).astype(int)# 使用 CatBoost 进行预测 predictions_cb cb_model.predict(test_features) predictions_cb_binary (predictions_cb 0.5).astype(int)# 评估模型 mcc_lgb matthews_corrcoef(y_val, lgb_model.predict(X_val, num_iterationlgb_model.best_iteration) 0.5) mcc_cb matthews_corrcoef(y_val, cb_model.predict(X_val) 0.5)print(LightGBM Matthews Correlation Coefficient: , mcc_lgb) print(CatBoost Matthews Correlation Coefficient: , mcc_cb)# 整理预测结果 submission_df pd.DataFrame({id: test_ids, class: predictions_lgb_binary}) submission_df[class] submission_df[class].map({1: p, 0: e})# 保存预测结果 submission_df.to_csv(submission.csv, indexFalse)# 可视化特征重要性 def plot_feature_importance(model, feature_names, title):fig, ax plt.subplots(figsize(12, 8))lgb.plot_importance(model, max_num_features20, importance_typegain, axax)ax.set_title(title)plt.show()# 可视化 LightGBM 特征重要性 plot_feature_importance(lgb_model, X_train.columns, LightGBM Feature Importance)# 可视化 CatBoost 特征重要性 cb_model.plot_feature_importances(top_n20, figsize(12, 8), titleCatBoost Feature Importance) 这段代码完成了以下任务导入所需的库。加载训练数据和测试数据。对数据进行预处理包括对类别特征进行编码。划分数据集为训练集和验证集。定义并训练 LightGBM 和 CatBoost 模型。对测试数据进行预测。评估模型的性能使用 Matthews Correlation Coefficient。整理预测结果并将其保存为 CSV 文件以供提交。可视化特征重要性。参考 Binary Classification with a Bank Churn Dataset Playground Series - Season 4, Episode 1 OverviewDataCodeModelsDiscussionLeaderboardRulesTeamSubmissions Samvel Kocharyan · 17th in this Competition · Posted 7 months ago arrow_drop_up9 more_vert 17th Place Solution| AutoML Unicorns pollen Lack of sleep Context S4E1 Playground Binary Classification with a Bank Churn Dataset. Business context: https://www.kaggle.com/competitions/playground-series-s4e1/overviewData context: https://www.kaggle.com/competitions/playground-series-s4e1/data Overview of the approach Our final submission was a combination of AutoGluon 3-level stack we called Frankenstein II and set of averages from our previous models and some public notebooks. Final submission was trained on the reduced set of features we got from OpenFE. Features were eliminated by BorutaSHAP and RFECV. Final model used 103 features. Detail of the Submissions We selected 2 submissions: WeightedEnsemble_L3 0.89372 Public | 0.89637 Private | 0.898947 CVWinning solution 0.90106 Private | 0.89687 Public. We got it from averaging 0.89673 and 0.89565 in last hours of the competition. Frankenstein II schema What worked for us? Feature generation - 470 and Feature Elimination - 103Data-Centric Approach (CleanLab)RelabelingAutoGluon 1.0.1 (thanks to innixma)BorutaSHAP framework and Skleran - RFECVIdeas published by paddykb, thomasmeiner and respected communityMerging, Stacking, Ensembling, AveragingTons of experiments. Mainly for educative purposes Kaggle Alchemists Secret Society named after Akka från Kebnekajse Unicorns pollen What doesnt work for us this time? PCA / ICAStandalone Boosting modelsTabPFNSurnames featuresOriginal dataset Sources https://www.kaggle.com/competitions/playground-series-s4e1/discussion/470363https://www.kaggle.com/competitions/playground-series-s4e1/discussion/471164https://www.kaggle.com/competitions/playground-series-s4e1/discussion/469859https://www.kaggle.com/competitions/playground-series-s4e1/discussion/465192https://www.kaggle.com/competitions/playground-series-s4e1/discussion/470610https://www.kaggle.com/code/arunklenin/ps4e1-advanced-feature-engineering-ensemblehttps://www.kaggle.com/code/thomasmeiner/ps4e1-eda-feature-engineering-modelling 根据您提供的信息这是一个关于Kaggle竞赛“Playground Series - Season 4, Episode 1”的解决方案概述。在这个竞赛中参赛者需要构建一个二分类模型来预测银行客户的流失情况。以下是该解决方案的主要部分业务背景目标预测客户是否会离开银行。数据包括客户的基本信息、交易记录等。数据背景数据集包含了多个特征用于预测客户是否会选择离开银行。特征包括但不限于年龄、性别、地理位置、账户余额、产品持有情况等。解决方案概览最终提交基于AutoGluon的3级堆叠模型结合了之前模型的平均结果。特征工程使用了OpenFE工具进行特征生成和消除最终使用了103个特征。模型训练使用了AutoGluon框架并结合了BorutaSHAP和RFECV进行特征选择。关键技术点特征生成与消除通过多种方法生成新特征并利用BorutaSHAP和RFECV进行特征选择。数据清理使用CleanLab进行数据清洗。标签修正进行了重新标注以提高准确性。AutoGluon使用版本1.0.1的AutoGluon进行自动机器学习。集成学习通过堆叠、合并、平均等技术提高了模型的泛化能力。未成功的方法 PCA/ICA主成分分析和独立成分分析并未提升模型性能。单独的Boosting模型单独使用Boosting模型效果不佳。TabPFN一种用于表格数据的神经网络架构在本竞赛中未取得显著效果。姓氏特征尝试使用客户的姓氏作为特征未能提升模型性能。原始数据集仅使用原始数据集的效果不如经过特征工程的数据集。实现代码考虑到上述解决方案的复杂性和涉及的技术下面是一个简化版的示例代码展示如何使用AutoGluon进行自动机器学习并结合特征选择的方法 import pandas as pd import numpy as np from autogluon.tabular import TabularPredictor from boruta import BorutaPy from sklearn.ensemble import RandomForestClassifier from sklearn.feature_selection import RFECV from sklearn.model_selection import StratifiedKFold from sklearn.pipeline import Pipeline# 数据路径 train_path train.csv test_path test.csv# 加载数据 train_data pd.read_csv(train_path) test_data pd.read_csv(test_path)# 数据预处理 # ...# 特征选择 # 使用BorutaSHAP进行特征选择 rf RandomForestClassifier(n_jobs-1, class_weightbalanced, max_depth5) feat_selector BorutaPy(rf, n_estimatorsauto, verbose2, random_state1) feat_selector.fit(train_data.drop(target, axis1), train_data[target])# 使用RFECV进行特征选择 rfecv RFECV(estimatorRandomForestClassifier(), step1, cvStratifiedKFold(5),scoringaccuracy, verbose2) pipeline Pipeline([(rfecv, rfecv)]) pipeline.fit(train_data.drop(target, axis1), train_data[target])# 根据特征选择结果更新训练和测试数据 selected_features list(set(feat_selector.support_) set(pipeline.named_steps[rfecv].support_)) train_data_selected train_data[selected_features [target]] test_data_selected test_data[selected_features]# 使用AutoGluon进行自动机器学习 predictor TabularPredictor(labeltarget, problem_typebinary).fit(train_datatrain_data_selected, presetsbest_quality, time_limit1200)# 预测 predictions predictor.predict(test_data_selected)# 保存预测结果 submission pd.DataFrame({id: test_data[id], target: predictions}) submission.to_csv(submission.csv, indexFalse) 注意事项请确保已安装AutoGluon、BorutaPy和其他必要的库。以上代码示例假设数据集已经过适当的预处理例如处理缺失值、转换类别特征等。根据实际数据集的特点可能还需要进一步调整参数和方法。参考资料和入门材料 - Playground Season 4 Episode 8 大家好祝您在 Playground 系列第 4 季第 08 集的同期剧集中一切顺利。我希望以下编译和入围的参考资料和链接能帮助您有效、快速地入职 - 原始数据集比赛和原始数据集被重新组织并发布在这里以供参考。二元分类器游乐场比赛使用银行流失数据集进行二元分类得票最多的内核 https://www.kaggle.com/code/abdmental01/bank-churn-lightgbm-and-catboost-0-8945https://www.kaggle.com/code/akhiljethwa/playground-s4e1-eda-modeling-xgboosthttps://www.kaggle.com/code/hardikgarg03/bank-churn-random-forest-xgboost-and-lightbgmhttps://www.kaggle.com/code/marianadeem755/bank-churn-classification-neural-network-xgboosthttps://www.kaggle.com/code/mouadberqia/bank-churn-prediction-beginner-friendly-0-88959PS4E1 | Advanced Feature Engineering | Ensemble | Kagglehttps://www.kaggle.com/code/danishammar/bank-churn-165034-dlhttps://www.kaggle.com/code/aspillai/bank-churn-catboost-0-89626 高分方法和讨论 https://www.kaggle.com/competitions/playground-series-s4e1/discussion/472496 -- 等级2https://www.kaggle.com/competitions/playground-series-s4e1/discussion/472413 -- 等级3https://www.kaggle.com/competitions/playground-series-s4e1/discussion/472636 -- 排名17https://www.kaggle.com/competitions/playground-series-s4e1/discussion/473257 -- 逾期提交https://www.kaggle.com/competitions/playground-series-s4e1/discussion/472502 -- 等级1https://www.kaggle.com/competitions/playground-series-s4e1/discussion/472497 -- 等级5https://www.kaggle.com/competitions/playground-series-s4e1/discussion/472466 -- 排名11 使用生物信号对吸烟者状况进行二元预测得票最多的内核 https://www.kaggle.com/code/cv13j0/efficient-prediction-of-smoker-statushttps://www.kaggle.com/code/arunklenin/ps3e24-eda-feature-engineering-ensemblehttps://www.kaggle.com/code/ravi20076/playgrounds3e24-eda-baselinehttps://www.kaggle.com/code/oscarm524/ps-s3-ep24-eda-modeling-submissionhttps://www.kaggle.com/code/ashishkumarak/binary-classification-smoker-or-not-eda-xgboost 高分方法和讨论 https://www.kaggle.com/competitions/playground-series-s3e24/discussion/455248 -- 排名 3https://www.kaggle.com/competitions/playground-series-s3e24/discussion/455296 -- 排名 4https://www.kaggle.com/competitions/playground-series-s3e24/discussion/455271 -- 排名 7https://www.kaggle.com/competitions/playground-series-s3e24/discussion/455268 -- 排名 8 使用软件缺陷数据集进行二元分类得票最多的内核 https://www.kaggle.com/code/ambrosm/pss3e23-eda-which-makes-sensehttps://www.kaggle.com/code/oscarm524/ps-s3-ep23-eda-modeling-submissionhttps://www.kaggle.com/code/iqbalsyahakbar/ps3e23-binary-classification-for-beginnershttps://www.kaggle.com/code/ravi20076/playgrounds3e23-eda-baselinehttps://www.kaggle.com/code/zhukovoleksiy/ps-s3e23-explore-data-stacking-ensemble 高分方法和讨论 https://www.kaggle.com/competitions/playground-series-s3e23/discussion/450315 -- 排名 2 机器故障的二元分类得票最多的内核 PS3E17 EDA| Ensemble ML Pipeline | SHAP | Kagglehttps://www.kaggle.com/code/yantxx/xgboost-binary-classifier-machine-failurehttps://www.kaggle.com/code/manishkumar7432698/pse17-feature-engineering-tuning-optunahttps://www.kaggle.com/code/tumpanjawat/s3e17-mf-eda-clustering-adaboosthttps://www.kaggle.com/code/akioonodera/ps-3-17-lgbm-bin 高分方法和讨论 https://www.kaggle.com/competitions/playground-series-s3e17/discussion/419730 -- 排名 3https://www.kaggle.com/competitions/playground-series-s3e17/discussion/419643 -- 排名 11 使用表格肾结石预测数据集进行二元分类得票最多的内核 https://www.kaggle.com/code/richeyjay/kidney-stone-prediction-eda-binary-classificationhttps://www.kaggle.com/code/kimtaehun/nice-eda-and-quick-xgb-baseline-in-2minuteshttps://www.kaggle.com/code/tumpanjawat/kidney-stone-eda-prediction-7-model-2-nnhttps://www.kaggle.com/code/hardikgarg03/kidney-stone-predictionPS3E12 | Simple EDA, FE, and Model for Beginners | Kaggle 高分方法和讨论 https://www.kaggle.com/competitions/playground-series-s3e12/discussion/402403 -- 等级 5https://www.kaggle.com/competitions/playground-series-s3e12/discussion/402416 -- 排名 8https://www.kaggle.com/competitions/playground-series-s3e12/discussion/402398 -- 排名 24 二元分类器特色竞赛美国运通 - 违约预测得票最多的内核 https://www.kaggle.com/code/ambrosm/amex-eda-which-makes-sensehttps://www.kaggle.com/code/ragnar123/amex-lgbm-dart-cv-0-7977AMEX Default Prediction EDA LGBM Baseline | Kagglehttps://www.kaggle.com/code/ambrosm/amex-lightgbm-quickstarthttps://www.kaggle.com/code/jiweiliu/rapids-cudf-feature-engineering-xgb 高分内核 https://www.kaggle.com/code/hideyukizushi/amex-inf-blend-onlyteam-v2Amex LGBM Dart CV 0.7977 | Kagglehttps://www.kaggle.com/code/thedevastator/the-fine-art-of-hyperparameter-tuninghttps://www.kaggle.com/code/rm1000/ensembling-with-vectorization 高分方法和讨论 https://www.kaggle.com/competitions/amex-default-prediction/discussion/348111 -- 排名第一https://www.kaggle.com/competitions/amex-default-prediction/discussion/347637 -- 排名 2https://www.kaggle.com/competitions/amex-default-prediction/discussion/349741 -- 等级 3American Express - Default Prediction | Kaggle -- 排名 5American Express - Default Prediction | Kaggle -- 排名 9 房屋信贷违约风险得票最多的内核 https://www.kaggle.com/code/willkoehrsen/start-here-a-gentle-introductionhttps://www.kaggle.com/code/codename007/home-credit-complete-eda-feature-importancehttps://www.kaggle.com/code/willkoehrsen/introduction-to-manual-feature-engineering 高分方法和讨论 https://www.kaggle.com/competitions/home-credit-default-risk/discussion/64821 -- 排名 1https://www.kaggle.com/competitions/home-credit-default-risk/discussion/64722 -- 排名 2https://www.kaggle.com/competitions/home-credit-default-risk/discussion/64596 -- 排名 3 Home Credit - 信用风险模型稳定性得票最多的内核 https://www.kaggle.com/code/greysky/home-credit-baselinehttps://www.kaggle.com/code/sergiosaharovskiy/home-credit-crms-2024-eda-and-submissionhttps://www.kaggle.com/code/pereradulina/credit-risk-prediction-with-lightgbm-and-catboost 高分方法和讨论 https://www.kaggle.com/competitions/home-credit-credit-risk-model-stability/discussion/508337 -- 排名第一https://www.kaggle.com/competitions/home-credit-credit-risk-model-stability/discussion/508113 -- 排名 10 竞争指标 - Mathews 相关性 https://en.wikipedia.org/wiki/Phi_coefficientMatthews’s correlation coefficient: Definition, Formula and advantages - VoxcoThe Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation | BioData Mining | Full Textmatthews_corrcoef — scikit-learn 1.5.2 documentation -- 这是 scikit-learn 指标文档第一名的解决方案第一名解决方案72 个 OOF一大堆 Autogluon以及 31 个 0.98512 或以上的分数在私人 LB 上很抱歉这么长的帖子 - 套用 Blaise Pascal 的名言我没有时间把它缩短。嗯这确实是一场非常令人满意的比赛我成功的核心与我上个月的帖子第 4 名解决方案PSS4E8中描述的以及一些背景故事相同 - 大型合奏以及在 Kaggle 之外缺乏重要资源的情况下一大堆忙碌碌的人。虽然这个月不像上个月那么令人沮丧“只有”300 万个样本而不是上个月的 11 个但我们确实有大约两倍的变量。但是在空间和存储方面一切都更易于管理尽管我在 GPU 配额和 Kaggle 的 12 小时运行限制方面确实遇到了一些熟悉的挫折。 TLDR 类似收集了一个名副其实的模型动物园将他们进行组合密切关注简历和模型的多样性并在分数增加的同时保持组合。在 Kaggle 上不断耗尽 GPU 和执行时间12 小时用 Autogluon 进行了更多实验不断尝试所有东西新模型、超参数、集成方法等。最终得到近 80 个 OOF 数组最后使用了 72 个最终获得 31 分 0.98512 或以上0.98511 是私人 LB 的第二高分其中第一分是在 8 月 17 日取得的比赛还剩两周。在我继续之前请允许我感谢那些分享他们的见解、发现和代码的人的慷慨解囊包括但不限于 ambrosm、siukeitin、nischaydnk、gauravduttakiit、rzatemizel、ravaghi、oscarm524、ravi20076、tilii7、roberthatch、omidbaghchehsaraei、trupologhelper、arunklenincarlmcbrideellis 这个月有 ambrosm 回来真是太好了因为他的另一本出色的 EDA 笔记本帮助我们中的许多人启动并运行——它以及一些混合帮助我在第 1 天获得了 0.98516 的分数私人0.98498。第 1 天就有几件事很清楚包括 Autogluon 在这个数据集上做得很好我在 LazyPredict 结果的笔记本中注意到gauravduttakiit Random Forests 和 Extra Trees 这次非常有竞争力并在心里记下了将它们包含在我的集成中。当我看到 siukeitin 关于原始数据集的精确解决方案的精彩文章时我立即添加了有毒的概率作为一项新功能理由是它可以代表当前数据集中仍然存在的原始“信号”。它有助于提高一些模型的分数就像包括原始模型一样 - 即使它们没有提高分数它们也增加了合奏的多样性。carlmcbrideellis 是 siukeitin 这项工作的催化剂因为他根据原始数据集提供了一个包含 100 万个蘑菇的数据集。他还发起了一场竞赛以在最短的时间内完美预测他的数据集上的标签 - 尝试一下帮助我弄清楚可以通过设置 “num_threads” CPU 内核数来加速 LGBM。这个月我在 Autogluon AG 上花了相当多的时间因为有很多笔记本使用它来在公共 LB 上取得高分。gauravduttakiit还展示了使用 GPU 和 AG 长时间运行的重要性当时仅使用 GPU 即可使分数从 0.98482 跃升至 0.98524而无需更改其余代码。同时我不得不集成几十个模型才能接近这个目标。我立即启动了 AG 的长时间 GPU 运行这导致了比赛中最令人沮丧的时刻因为 Kaggle 在 12 小时后杀死了笔记本就在生成输出文件的过程中从一开始我就使用各种方法探索了集成包括 Hill Climbing这个月比上个月更可行当时它不是 10-15 个模型之后的入门。这个月我一直用到最后尽管一旦我超过 2 个模型左右它就花了 60 个多小时。我的一个突破性分数0.98530没有任何混合来自于 Ridge 和 GBDT 的组合进行集成。然而Ridge 通常给出了速度和 LB 分数的最佳组合所以我的大部分提交都使用了它。它帮助我获得了第一分我开始感到自信因为大约 50 个模型的合奏在没有任何“盲混”的情况下达到了 0.98525。我不知道的是这也在私人 LB 上获得了 0.98512 的分数还剩 2 周这可能足以获胜。所以从某种意义上说在这之后我正在增加微小的差距尽管我当然无法知道这一点。使用 Autogluon AG 进行实验的高点和低点还剩大约 10 天我决定投入更多时间进行 AG 实验这似乎可能有助于进一步提高我的分数。在仔细阅读了几本笔记本后我注意到 XGBoost 和 CatBoost 是 AG 中性能最弱的模型这很有趣尤其是因为 XGBoost 在 AG/AutoML 之外表现最好。我推断排除它们可能会通过给性能更好的模型更多时间来提高 AG 的分数 - 它没有改善但也没有恶化一个人可以在大约一半的时间内获得相同的分数。然后我注意到顶级合奏几乎总是只有 GBM 和 XT所以我放弃了其他所有内容分数只少了大约 0.0001。最后我决定通过 AG 单独运行单个模型并自己进行集成看看这是否会为每个模型留出更多时间从而提高集成分数它确实做到了尽管同样只有大约 0.0001。最后我决定将我的 OOF 扔进 AG - 但这是后面的故事大约两段之后。通往 0.98535 的三种方式通往 0.98537 在 6 月的“崩溃”之后当我在下半个月占据第一位置时过度拟合到遗忘时我仍然专注于构建一个强大的集合。但是当我当天还剩下提交的作品时我并没有排除一些 “盲混”因为时间太少无法获得新的结果来添加到合奏中事实上这就是我本月第一次获得第一名的原因之一。我确实尝试使用另外两个解决方案Gaurav和nischaydnk的它们的构建方式与我的不同但得分大致相同均为0.98525最终得到0.98532。几天后我的两个这样的混合与 arunklenin 的 0.98527 一起让我达到 0.98534 并取得领先。最后我使用“插入置信不一致”方法将我的预测与另一个模型/集成如果后者产生足够高的概率比如 0.99则产生我的第一个 LB 分数 0.98535尽管理由相当不稳定此的私有 LB 结果仅为 0.98506。到目前为止我的 CV 分数通常 0.98510有合奏独奏 0.9850范围0.97844 - 0.98494。我开始使用 AG OOF 和单独模型这最终帮助我获得了 0.985087 的 CV以及 0.98533 的 LB私有0.98513和 66 个 OOF。在这一点上我开始感觉很好因为这很容易成为我在没有与其他提交作品有任何混合的情况下获得的最佳分数。同时带有 CPU 的 AG 给了我大约 0.98524。最后我决定在 AG 中加入一些 OOF - 我担心太多的 OOF 会耗尽 Kaggle 的运行时间所以我使用 Hill Climbing 来决定使用哪些 OOF并在得分最高的 AG OOF 之上添加了 8 个由 Hillclimbing 选择的 OOF。将这种混合物扔进 AG 中我启动了跑步并疯狂地持续监控中间结果直到跑步以 0.985124 的 AG 排行榜得分结束。兴奋的我满怀期待地提交了然后宾果游戏LB 评分为 0.98532 private 0.98516。在 0.98533 和 0.98532 的集合中我感觉越来越好尽管我很清楚任何数量的聪明的 Kaggler 都可能随时超过我有些可能也被潜伏在附近的搅拌机大军所隐藏。最后我决定把谨慎抛在脑后将所有 72 个 OOF 都扔进了 AG令我高兴的是即使是 CPU 运行在 LB私有0.98512上也产生了 0.98535这个 0.98535 比第一次更有信心。与此同时我看到很多人好几天都卡在 0.98533 和 0.98534 上所以看起来 0.98535 确实可能接近获胜分数。我没有剩余的 GPU 配额所以我在 Kaggle 之外搜索但徒劳无功。Saturn Cloud 每月提供 15 小时但如果没有他们团队的帮助你不能一次运行那么长时间。Lightning.ai 每月提供 22 小时的 GPU但一次不超过 4 小时的 GPU。尽管如此我还是尝试在那里重复 AG 运行那里有 72 个 OOF并很快意识到它们缺少几个我认为在 Kaggle 上理所当然的包因为它们是为深度学习设置的。所以没有 LGBM最初令人震惊等等。我注意到他们可以选择使用 32 个 CPU所以我决定用它运行 3 小时理由是这可能比在 Kaggle 上使用 4 个 CPU 的 12 小时要好。我担心结果可能平淡无奇但令我松了一口气的是它又产生了 0.98535私人0.98513。到这个时候有很多人就在我身后——我更关心已知的表现强劲的股票比如 tilii7 的 0.98533 和 oscarm524 的 0.98532因为我知道他们都在做扎实的工作而不仅仅是混入以太。我沉迷于我的 0.98535 秒的不那么盲目的混合这导致公共 LB 分数为 0.98537私人0.98513- 我知道它不一定会在私人 LB 上得分更高但至少它可能会让一些追求者停下来我还对我的两个实心 0.98535 进行了一些试错集成但没有在最后两个中选择任何一个因为它们的得分为 0.98535 或更低。有趣的是我最好的私人 LB 分数来自这里——50-50 的混合产生了 0.98517 的私人分数这是我的最高分;其他几个产生 0.98514。我的最高分 0.98535 与 Ridge 系综的 0.98533 的 90-10 混合产生了 0.98533 的公开分数 0.98533但私人分数为 0.98516第二高。故事的寓意 - 相信你的 CV 分数并在保持 CV 和 LB 良好一致性的同时继续建设。尽可能避免盲目混合尽管这可能很诱人。我在过去的几天里有一些很棒的计划但是家庭紧急情况和动力耗尽的结合意味着大部分事情仍然没有实现。我无法遵循 tilii7 学习 xLearn 的建议没有运行我上次运行的 TabNet 之类的模型也没有对优化任何一个模型进行足够深入的研究例如将 XGBoost 推到 0.9850 CV 以上或从 CV 0.9848 范围内拯救 CatBoost等等。一直以来我的公共 LB 分数比我在 Ridge 的 CV 分数高出约 0.0002比我在 Hill Climbing 的 CV 分数高出 0.0001。所以我预计私人 LB 分数与我的简历分数大致相同事实证明确实如此。本月初许多人表示相信不会有重大变化因为我们有数百万个样本而有些人比如 oscarm524预计会发生变化因为人们会以各种方式处理嘈杂的数据而这些方式可能无法推广到私人数据。最后Blender 证明即使是包含数百万个样本的数据集也确实可以过度拟合因为发生了相当大的变化。另一方面像 neupane9sujal、bwandowando、co000l、ravaghi、roberthatch 等人在 LB 上跳了 50-200 个位置令人印象深刻祝贺他们以及所有进入前 10 名或 25 名的人。就个人而言在两个月前从 1 下降到 113 之后这个月感觉像是 Kaggler 成长了一点。整个月我一直想离开Kaggle花更多的时间在我应该也在做的LLM课程上但我几乎非常着迷。上个月我获得了第 4 名但由于tilii7之前已经进入了前 3 名我得到了一件 T 恤。我当时说过总有一天我会赚到别人得到的 T 恤——这已经实现了真是太令人欣慰了。既然我已经设法获得了一件 T 恤和第一名我将退后一步更明智地参与因为我真的需要花时间参加 LLM 课程任何指向 LLM 项目的有趣数据集的指针吗提前感谢我会不时地提交但如果那样的话我会在当月的最后一周保持积极参与。祝大家一切顺利在 Playground 系列赛中追逐排行榜真是太棒了非常感谢一路上帮助过的每个人。我现在想开始花更多时间在 Kaggle 的其他地方和其他地方但将继续参加 Playground 系列这对我来说是艰难的一年中最好的事情之一 - 非常感谢 Kaggle 以及所有让这成为如此有趣和引人入胜的经历的人。快乐 Kaggling 1st Place Solution: 72 OOFs, a whole lotta Autogluon, and 31 scores of 0.98512 or above (on the private LB) Apologies for such a long post - to paraphrase the famous words of Blaise Pascal, I didnt have the time to make it shorter. Well, that was a very satisfying competition indeed! The core of my success was the same as described (along with some back story) in my post from last month (4th place solution, PSS4E8) - large ensembles, and a whole lotta hustling in the absence of serious resources outside Kaggle. While this month wasnt as frustrating as last month, with only 3 million samples instead of last months 11, we did have about twice as many variables. But everything was more manageable in terms of space and storage, though I did face some familiar frustrations in terms of GPU quota and 12 hour run limits on Kaggle. The TLDR is similar: Gathered a veritable zoo of models, ensembled them, kept an eye on CV model diversity, kept ensembling while score increasedKept running out of GPU and execution time (at 12 hours) on KaggleExperimented a lot more with AutogluonKept experimenting with everything: new models, hyperparameters, ensembling approaches, etc. Ended up with nearly 80 OOF arrays, used 72 in the endEnded up with 31 scores of 0.98512 or above (0.98511 being the second highest on the private LB), the first of which was achieved on August 17th, with two weeks remaining in the competition. Before I go on, let me acknowledge the generosity of those who shared their insights, findings and code, including but not limited to ambrosm, siukeitin, nischaydnk, gauravduttakiit, rzatemizel, ravaghi, oscarm524, ravi20076, tilii7, roberthatch, omidbaghchehsaraei, trupologhelper, arunklenin, carlmcbrideellis It was great to have ambrosm back this month, as another of his wonderful EDA notebooks helped many of us get up and running - it (and some blending) helped me get to a score of 0.98516 on day 1 (private: 0.98498). A few things were clear right on day 1, including that Autogluon was doing very well on this dataset, and I noticed in gauravduttakiits notebook with LazyPredict results that Random Forests and Extra Trees were quite competitive this time around, and made a mental note about including them in my ensembles. As soon as I saw siukeitins brilliant post about an exact solution to the original dataset, I added the probability of being poisonous as a new feature, reasoning that it could be a proxy for the signal from the original still present in the current dataset. It helped boost the score of some models, just as including the original did - and even when they didnt boost the score, they added to the diversity of the ensemble. carlmcbrideellis was the catalyst for that work by siukeitin, as he provided a dataset with a million mushrooms, based on the original dataset. He also initiated a competition for perfectly predicting the labels on his dataset in the least time - playing around with that helped me figure out that one can speed up LGBM by setting num_threads number of CPU cores. I spent a fair amount of time on Autogluon (AG) this month, as there were so many notebooks using it to achieve great scores on the public LB. gauravduttakiit also showed the importance of using GPUs and long runs with AG, when just using a GPU made the score jump from 0.98482 to 0.98524, without changing the rest of the code. Meanwhile I had to ensemble a few dozen models to get anywhere near that. I immediately launched a long GPU run of AG, which led to the single-most frustrating moment of the competition, as Kaggle killed the notebook after 12 hours, right in the middle of producing the output files From the beginning, I explored ensembling using various methods, including Hill Climbing, which was more feasible this month than last, when it was a nonstarter beyond 10-15 models. This month I used it till the very end, though it took over 2 hours once I went beyond 60 models or so. One of my breakthrough scores (0.98530 without any blending) came via a combination of Ridge and GBDTs for ensembling. However, Ridge generally gave the best combination of speed and LB score, so most of my submissions used that. It helped me get to the first score where I started to feel confident, as an ensemble of about 50 models achieved 0.98525 without any blind blending. Unbeknownst to me, this also achieved a score of 0.98512 on the private LB (with 2 weeks to go), which might have sufficed to win. So in some sense, I was running up the (teeny) margin after this, though of course I had no way of knowing this. Highs and lows of experiments with Autogluon (AG) With about 10 days remaining, I decide to invest more time in experimenting with AG, which seemed likely to help push my scores further. After perusing several notebooks, I noticed that XGBoost and CatBoost were the weakest models within AG, which was interesting, especially since XGBoost was the best performing outside of AG/AutoML. I reasoned that excluding them might improve AGs score by giving more time to the better performing models - it didnt improve, but it didnt worsen either, and one could achieve the same score in about half the time. I then noticed that the top ensemble was almost always of GBM and XT alone, so I dropped everything else, and the score was only about 0.0001 less. Finally, I decided to run individual models alone via AG, and ensemble them myself, to see whether this would allow each model more time, and thereby lead to an improved ensemble score, and it did, though again only by about 0.0001. Finally, I decided to throw my OOFs into AG - but thats a story for later (about two paragraphs later). Three ways to 0.98535, en route to 0.98537 After the debacle of June, when I held the no. 1 spot for the second half of the month while overfitting to oblivion, I remained focused on building a robust ensemble. But I wasnt above some blind-blending when I had submissions remaining for the day with too little time to get new results to add to the ensemble, and indeed that was part of how I first got to number 1 this month. I did try to use two other solutions (Gaurav nischaydnks) built differently from mine but with about the same score of 0.98525, and ended up with 0.98532. A few days later, two such blends of mine with arunklenins 0.98527 got to me to 0.98534 and into the lead. Finally I used the insert confident disagreements approach to overwrite my prediction with those of another model/ensemble, if the latter produced a sufficiently high probability (say 0.99), which produced my first LB score of 0.98535, albeit on rather shaky grounds (the private LB for this turned out to be mere 0.98506). So far, my CV scores had generally been 0.98510 (with ensembles), and 0.9850 with solo models (range: 0.97844 - 0.98494). I started using AG OOFs along with the solo models, and this finally helped me get to a CV of 0.985087, and LB of 0.98533 (private: 0.98513) with 66 OOFs. At this point, I was starting to feel good, since this was easily the best score Id obtained without any blending with others submissions. Meanwhile, AG with CPU was giving me about 0.98524. Finally, I decided to throw in some OOFs into AG - I was wary of too many OOFs exhausting the run time on Kaggle, so I used Hill Climbing to decide which OOFs to use, and added the 8 which were chosen by hill climbing on top of the highest scoring AG OOF. Throwing this mix into AG, I launched the run and frantically kept monitoring the intermediate results, until the run concluded with an AG leaderboard score of 0.985124. Excited, I submitted with anticipation, and bingo! the LB score was 0.98532 (private: 0.98516). With ensembles of 0.98533 and 0.98532, I was feeling better and better, though I was quite aware that any number of brilliant Kagglers could overtake me at any time (some probably hidden by the army of blenders lurking nearby as well). At long last, I decided to throw caution to the winds, and threw all 72 OOFs into AG, and to my delight, even a CPU run produced 0.98535 on the LB (private: 0.98512), an 0.98535 that I was much more confident about than the first one. In the meantime, Id seen many people stuck on 0.98533 and 0.98534 for days, so it did seem that 0.98535 was potentially close to a winning score. I had no GPU quota left, so I searched outside Kaggle, in vain. Saturn Cloud offered 15 hours per month, but you couldnt run anything for that long at a go without assistance from their team. Lightning.ai offered 22 hours per month, but no more than 4 hours of GPU at a time. Nevertheless, I tried to repeat the AG run with 72 OOFs there, and quickly realized that they lacked several packages I took for granted on Kaggle, as they were set up for Deep Learning. So no LGBM (initially a shock!), and so on. I noticed that they had an option to use 32 CPUs, so I decided to go for a 3 hour run with that, reasoning that it might just be better than 12 hours with 4 CPUs on Kaggle. I was afraid that the results might be underwhelming, but to my great relief, it produced another 0.98535 (private: 0.98513). By this time, there were lots of people right behind me - I was more concerned about known strong performers like tilii7 at 0.98533 and oscarm524 at 0.98532, since I knew they were doing solid work and not just blending away into the ether. I indulged in a not so blind blend of my 0.98535s, which led to a public LB score of 0.98537 (private: 0.98513) - I knew it wasnt necessarily going to score any higher on the private LB, but at least it might give pause to some of the pursuers I also did some trial and error ensembling of my two solid 0.98535s, but didnt choose any of them among the final two, as they scored 0.98535 or lower. Interestingly, my best private LB scores came from here - a 50-50 blend produced a private score of 0.98517, my highest; several others produced 0.98514. A 90-10 blend of my highest 0.98535 with the 0.98533 from a Ridge ensemble of 66 models produced a public score of 0.98533, but private of 0.98516 (second highest). Moral of the story - trust your CV score, and keep building while keeping CV and LB in good agreement. Avoid blind blending as much as possible, tempting though it may be. I had some great plans for the last few days, but a combination of a family emergency running out of steam meant most of it remained unrealized. I couldnt follow tilii7s advice of learning xLearn, didnt run models like TabNet which Id run last time, and didnt do a sufficiently deep dive into optimizing any one model, like pushing XGBoost beyond 0.9850 (CV), or rescuing CatBoost from the CV 0.9848 range, etc. All along, my public LB scores were about 0.0002 more than my CV scores with Ridge, and 0.0001 than my CV scores with Hill Climbing. So I was expecting the private LB scores to be about the same as my CV scores, and that proved to be the case. Early in the month, many had expressed confidence that there wouldnt be a major shakeup, as we had millions of samples, whereas some, like oscarm524, expected a shakeup, since people would deal with the noisy data in various ways that may not generalize to the private data. In the end, the blenders proved that one can indeed overfit even datasets with millions of samples, as there was quite a shakeup. On the other hand, people like neupane9sujal, bwandowando, co000l, ravaghi, roberthatch and others made impressive jumps of 50-200 positions on the LB! Congratulations to them and everyone who finished in the Top 10 or 25. Personally, after dropping from 1 to 113 two months ago, this month feels like having grown a bit as a Kaggler. All this month, I kept meaning to turn away from Kaggle spend more time on the course on LLMs that Im (supposed to be) also doing, but I was pretty much obsessed. Last month, I came 4th but got a tshirt thanks to tilii7 having already finished in the Top 3 before. Id said then that one day, Ill earn the t-shirt that someone else gets - its immensely gratifying to have that come true already. Now that Ive managed to get a t-shirt and the no. 1 spot, I shall step back and participate more judiciously, as I really need to put in time on the LLM course (any pointers to interesting datasets for an LLM project? Thanks in advance!). Ill keep submitting from time to time, but shall keep intense participation for the last week of the month, if then. All the best to everyone! Its been an amazing six months chasing the leaderboard in the playground series! Many thanks to everyone who helped along the way. I want to start spending more time on the rest of Kaggle (and elsewhere) now, but shall continue to participate in the Playground Series, which has been one of the best things about a difficult year for me - many thanks to Kaggle, and to all of you who make this such a fun and engaging experience. Happy Kaggling! sample_submission.CSV 解题思路 MushroomClassificationAnalysis (kaggle.com) 带入必要的库 Importing Libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from xgboost import XGBRegressor import lightgbm as 1gb import gc 马修相关系数Mathew Correlation Coefficient简称MCC 这张图片主要介绍了马修相关系数Mathew Correlation Coefficient简称MCC这是一种衡量二分类结果质量的指标。以下是图片内容的详细解释标题 Mathew Correlation Coefficient (MCC) - Quantifies the quality of binary classification 主要内容公式 MCC的计算公式如下 MCC (TP * TN - FP * FN) / sqrt((TP FP)(TP FN)(TN FP)(TN FN)) 优势平衡测量信息性测量不依赖于阈值对不平衡数据集有用对多类别分类有用强度等级随机预测 (-1到1之间)反向预测 (0.3)弱预测 (0.3到0.7)中等预测 (0.7到1) 实际应用示例假设我们有一个欺诈检测问题目标是区分非欺诈合法和欺诈行为。实际类别合法0欺诈1预测类别合法0欺诈1计算TP、FP、FN和TN的数量合法类TP6FN2欺诈类FP1TN3 根据上述数量计算MCC值 MCC (63 - 12) / sqrt((61)(21)(13)*(31)) ≈ 0.478 来源和作者信息 Danny ButunikikThe AI Vanguard newsletter 总的来说MCC是一个非常有用的评价二分类结果质量的指标尤其适用于不平衡数据集或多类别分类问题。竞赛数据集文件加载 Loading DatasetC df_subpd.read_csv(/kaggle/input/playground-series-s4e8/sample-supmiss ion.csv) df_trainpd.read_csv(/kaggle/input/playground-series-s4e8/train.csv) df_testpd.read_csv(/kaggle/input/playground-series-s4e8/test.csv) 从您的代码片段来看您正在加载Kaggle上的一个竞赛数据集。这里您加载了三个CSV文件分别对应提交样本、训练集和测试集。让我们详细解释这段代码及其用途 1. 加载提交样本 (sample_submission.csv) df_sub pd.read_csv(/kaggle/input/playground-series-s4e8/sample-submission.csv) 这个语句使用 pandas 库的 read_csv 方法读取位于 /kaggle/input/playground-series-s4e8/ 目录下的 sample-submission.csv 文件并将其存储在一个名为 df_sub 的 DataFrame 中。通常情况下提交样本文件会包含一些示例ID和可能的默认预测值用于指导参赛者如何格式化他们的提交文件。 2. 加载训练集 (train.csv) df_train pd.read_csv(/kaggle/input/playground-series-s4e8/train.csv) 这条语句读取训练数据集并将其存储在一个名为 df_train 的 DataFrame 中。训练数据集通常包含特征输入变量和标签输出或目标变量用于训练机器学习模型。 3. 加载测试集 (test.csv) df_test pd.read_csv(/kaggle/input/playground-series-s4e8/test.csv) 这条语句读取测试数据集并将其存储在一个名为 df_test 的 DataFrame 中。测试数据集通常只包含特征而不包含标签因为它的目的是用来评估训练好的模型在未知数据上的表现。数据集的作用训练集 (train.csv)用于训练模型。测试集 (test.csv)用于验证模型的泛化能力即模型对未见过的数据的表现。提交样本 (sample_submission.csv)提供了提交格式的模板通常包含所有测试集的ID和一个占位符列用于填写预测值。注意事项确保路径正确并且文件存在。如果您是从Kaggle笔记本环境中运行代码那么路径通常是正确的因为Kaggle会为您创建相应的输入目录。如果是在本地环境中运行则需要确保下载了数据集并放置在正确的路径下。检查训练数据集中分类特征的分布情况 Checking Distribution of categorical features Checking Distribution of categorical features df_train.info() categorical_columns df_train.select_dtypes(include[object]).columns unique_values {col:df_train[col].nunique()for col in categorical_columns} for col,unique_count in unique_values.items(): print(f{col}:{unique_count}unique values) gc.collect() 您提供的代码段旨在检查训练数据集中分类特征的分布情况。下面是这些代码行的解释及修正检查数据集信息首先通过调用 df_train.info() 方法来获取关于数据集的一些基本信息如每列的数据类型、非空条目数等。 df_train.info() 获取分类特征列名接着选择数据类型为对象类型通常用于字符串的列这通常代表分类特征。 categorical_columns df_train.select_dtypes(include[object]).columns 计算每个分类特征的唯一值数量然后计算每个分类特征的唯一值数量并将结果存储在一个字典中。 unique_values {col: df_train[col].nunique() for col in categorical_columns} 打印每个分类特征的唯一值数量最后遍历 unique_values 字典并打印每个分类特征的名称以及其拥有的唯一值数量。 for col, unique_count in unique_values.items():print(f{col}: {unique_count} unique values) 垃圾回收执行 gc.collect() 是为了手动触发垃圾回收机制清理不再使用的内存空间。这一步骤在数据分析过程中并非总是必要但在处理非常大的数据集或者内存有限的情况下可能会有所帮助。 gc.collect() 完整代码段以下是整理后的完整代码段 import gc # 导入垃圾回收模块# 查看数据集的基本信息 df_train.info()# 获取分类特征列名 categorical_columns df_train.select_dtypes(include[object]).columns# 计算每个分类特征的唯一值数量 unique_values {col: df_train[col].nunique() for col in categorical_columns}# 打印每个分类特征的唯一值数量 for col, unique_count in unique_values.items():print(f{col}: {unique_count} unique values)# 触发垃圾回收 gc.collect() categorical_columns df_test.select_dtypes(include[object]).columns unique_values {col:df_test[col].nunique()for col in categorical_columns} for col,unique_count in unique_values.items(): print(f{col}:{unique_count}unique values) gc.collect() df_test.columns df_train.columns 看起来您想检查测试数据集中分类特征的分布情况并且想要列出测试集和训练集的所有列名。以下是修正后的代码段检查测试集中分类特征的分布情况首先我们需要确保 categorical_columns 是从 df_test 而不是 df_train 中提取的分类特征列名。然后计算这些分类特征的唯一值数量并打印出来。 # 获取测试集中分类特征列名 categorical_columns df_test.select_dtypes(include[object]).columns# 计算每个分类特征的唯一值数量 unique_values {col: df_test[col].nunique() for col in categorical_columns}# 打印每个分类特征的唯一值数量 for col, unique_count in unique_values.items():print(f{col}: {unique_count} unique values)# 触发垃圾回收 gc.collect() 列出测试集和训练集的所有列名接下来我们列出 df_test 和 df_train 的所有列名。 # 列出测试集的所有列名 print(Test set columns:) print(df_test.columns)# 列出训练集的所有列名 print(\nTrain set columns:) print(df_train.columns) 整理后的完整代码以下是整理后的完整代码 import gc # 导入垃圾回收模块# 获取测试集中分类特征列名 categorical_columns df_test.select_dtypes(include[object]).columns# 计算每个分类特征的唯一值数量 unique_values {col: df_test[col].nunique() for col in categorical_columns}# 打印每个分类特征的唯一值数量 for col, unique_count in unique_values.items():print(f{col}: {unique_count} unique values)# 触发垃圾回收 gc.collect()# 列出测试集的所有列名 print(\nTest set columns:) print(df_test.columns)# 列出训练集的所有列名 print(\nTrain set columns:) print(df_train.columns) 这段代码将首先输出测试集中每个分类特征的唯一值数量然后分别输出测试集和训练集的所有列名。这样可以帮助您了解数据集的结构和分类特征的分布情况。如果您的数据集中有其他类型的分类特征比如编码为整型或其他类型的分类数据则需要调整 select_dtypes 的参数以适应实际情况。 missing_train df_train.isna().mean() * 100 missing_test df_test.isna().mean() * 100print(Columns in df_train with more than 10% missing values:) print(missing_train[missing_train 0])print(\nColumns in df_test with more than 10% missing values:) print(missing_test[missing_test 0]) 进行探索性数据分析Exploratory Data Analysis简称 EDA是为了更好地理解数据集的特点、发现模式、检测异常值以及检验假设。使用 seaborn 这样的可视化库可以极大地帮助我们进行这一过程。下面是一个基本的 EDA 示例展示了如何使用 seaborn 库来可视化和探索数据集中的特征。我们将从以下几个方面进行探索查看数据的基本统计信息了解数据集的规模、缺失值情况等。可视化数值特征的分布了解数值特征的分布情况。分析分类特征与目标变量的关系使用条形图等来展示不同类别的频率。数值特征与目标变量的关系使用箱线图或小提琴图来观察数值特征与目标之间的关系。特征间的相关性使用热力图来显示数值特征之间的相关性。示例代码假设您已经有了 df_train 和 df_test 数据集我们可以开始进行 EDA import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import gc# 如果没有已经加载数据集这里可以加载 # df_train pd.read_csv(/path/to/train.csv) # df_test pd.read_csv(/path/to/test.csv)# 查看数据集的基本信息 print(Training dataset info:) df_train.info() print(\nTesting dataset info:) df_test.info()# 获取分类特征列名 categorical_columns df_train.select_dtypes(include[object]).columns# 获取数值特征列名 numerical_columns df_train.select_dtypes(include[int64, float64]).columns# 绘制分类特征的分布 for col in categorical_columns:plt.figure()sns.countplot(xcol, datadf_train)plt.title(fDistribution of {col})plt.show()# 绘制数值特征的分布 for col in numerical_columns:plt.figure()sns.histplot(datadf_train, xcol, kdeTrue)plt.title(fDistribution of {col})plt.show()# 如果有目标变量绘制目标变量的分布 if target in df_train.columns:plt.figure()sns.countplot(xtarget, datadf_train)plt.title(Distribution of Target)plt.show()# 分析数值特征与目标变量的关系 if target in df_train.columns:for col in numerical_columns:plt.figure()sns.boxplot(xtarget, ycol, datadf_train)plt.title(fRelationship between {col} and Target)plt.show()# 特征间的相关性 if len(numerical_columns) 1:plt.figure(figsize(10, 8))correlation_matrix df_train[numerical_columns].corr()sns.heatmap(correlation_matrix, annotTrue, cmapcoolwarm)plt.title(Correlation Matrix)plt.show()# 触发垃圾回收 gc.collect() 代码说明基本信息使用 info() 方法来查看数据集的概览信息。分类特征分布使用 countplot 来绘制每个分类特征的频数分布。数值特征分布使用 histplot 来绘制数值特征的直方图并加上核密度估计KDE曲线。目标变量分布如果有目标变量使用 countplot 绘制其分布。数值特征与目标变量的关系使用 boxplot 来观察数值特征与目标变量之间的关系。特征间的相关性使用 heatmap 来展示数值特征之间的相关性矩阵。要检查训练集和测试集中各列的缺失值比例并找出缺失值超过10%的列可以使用如下代码 # 计算训练集中各列的缺失值比例 missing_train df_train.isna().mean() * 100# 计算测试集中各列的缺失值比例 missing_test df_test.isna().mean() * 100# 打印训练集中缺失值比例超过10%的列 print(Columns in df_train with more than 10% missing values:) print(missing_train[missing_train 10])# 打印测试集中缺失值比例超过10%的列 print(\nColumns in df_test with more than 10% missing values:) print(missing_test[missing_test 10]) 解释计算缺失值比例 df_train.isna() 和 df_test.isna() 返回一个布尔DataFrame其中True表示该位置的值为缺失值。mean() 方法计算每一列的平均值也就是该列中缺失值的比例。乘以 100 将比例转换成百分比形式。筛选缺失值超过10%的列使用条件 missing_train 10 和 missing_test 10 来筛选出缺失值比例大于10%的列。print 函数输出符合条件的列及其缺失值比例。示例输出假设 df_train 和 df_test 中存在一些缺失值代码执行后可能会输出类似以下内容 Columns in df_train with more than 10% missing values: feature_a 15.234 feature_b 12.765 dtype: float64Columns in df_test with more than 10% missing values: feature_a 14.567 feature_c 11.345 dtype: float64 这意味着在训练集中feature_a 和 feature_b 的缺失值比例超过了10%而在测试集中feature_a 和 feature_c 的缺失值比例超过了10%。处理缺失值的方法一旦识别出缺失值较多的列可以考虑以下几种方法来处理缺失值删除含有缺失值的行或列如果某一行或某一列的缺失值比例非常高可以考虑删除这些行或列。填充缺失值可以使用均值、中位数或众数来填充数值型特征的缺失值对于分类特征可以考虑使用众数或特定类别来填充。使用插值方法对于时间序列数据可以使用前向填充ffill或后向填充bfill。构建预测模型可以使用其他特征来预测缺失值。使用特定标志可以创建一个新的特征来指示某个值是否缺失。各特征的缺失值比例您希望绘制一个条形图来展示训练数据集中各特征的缺失值比例并按缺失值比例从高到低排序。以下是修正后的代码用于实现这一目标 import seaborn as sns import matplotlib.pyplot as plt# 计算训练集中各列的缺失值比例 missing_values df_train.isnull().mean() * 100# 筛选出具有缺失值的特征 missing_values missing_values[missing_values 0]# 对缺失值比例进行降序排序 missing_values missing_values.sort_values(ascendingFalse)# 绘制条形图 plt.figure(figsize(10, 6)) sns.barplot(xmissing_values.index, ymissing_values.values, paletteviridis) plt.xticks(rotation90) plt.xlabel(Features) plt.ylabel(Percentage of Missing Values) plt.title(Missing Values Distribution in df_train) plt.show() 代码解释计算缺失值比例 missing_values df_train.isnull().mean() * 100 这一行代码计算了 df_train 中每一列的缺失值比例并将其转换为百分比形式。筛选具有缺失值的特征 missing_values missing_values[missing_values 0] 这一行代码将缺失值比例为0的特征排除在外只保留有缺失值的特征。按缺失值比例排序 missing_values missing_values.sort_values(ascendingFalse) 这一行代码将特征按照缺失值比例从高到低排序。绘制条形图 plt.figure(figsize(10, 6)) sns.barplot(xmissing_values.index, ymissing_values.values, paletteviridis) plt.xticks(rotation90) plt.xlabel(Features) plt.ylabel(Percentage of Missing Values) plt.title(Missing Values Distribution in df_train) plt.show() 这几行代码使用 seaborn 库的 barplot 方法绘制条形图展示每个特征的缺失值比例并设置了图表的样式和标题。运行结果执行上述代码后您将得到一个条形图其中 X轴表示特征名称。Y轴表示缺失值的比例以百分比形式。条形图的颜色使用 viridis 调色板。此外X轴标签旋转了90度以便更好地显示特征名称。 !pip install dython from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder,OrdinalEncoder import category_encoders as ce missing_threshold 0.95 high_missing_columns df_train.columns[df_train.isnull().mean()missing_threshold] df_train df_train.drop(columnshigh_missing_columns) df_test df_test.drop(columnshigh_missing_columns) target class for column in df_train.columns: if df_train[column].isnull().any(): if df_train[column].dtype object: mode_value df_train[column].mode()[0] df_train[column].fillna(mode_value,inplaceTrue) df_test [column].fillna(mode_value,inplaceTrue) else: median_value df_train[column].median() df_train[column].fillna(median_value,inplaceTrue) df_test [column].fillna(median_value,inplaceTrue) ]: from dython.nominal import associations associations_df associations(df_train[:10000],nominal_columnsall,plotFalse) corr_matrix associations_df[corr plt.figure(figsize(20,8)) plt.gcf().set_facecolor(#FFFDD0) sns.heatmap(corr_matrix,annotTrue,fmt.2f,cmapcoolwarm,linewidths0.5) plt.title(Correlation Matrix including Categorical Features) plt.show() ]: import plotly.express as px df_train1 df_train[:10000].copy() feature_counts df_train1.groupby([cap-shape,cap-color])size().reset_index(namecount fig px.sunburst(feature_counts,path[cap-shape,cap-color]valuescount, colorcount,color_continuous_scaleViridis, titleSunburst Chart of Cap Shape and Cap Color Distribution fig.update_layout(title_textSunburst Chart of Cap Shape and Cap Color Distribution, title_x0.5,width900,height600) fig.show() import plotly.graph_objects as goflow_data df_train1.groupby([cap-shape, cap-color]).size().reset_index(namecount) labels list(pd.concat([flow_data[cap-shape], flow_data[cap-color]]).unique()) label_map {label: idx for idx, label in enumerate(labels)}sources flow_data[cap-shape].map(label_map).tolist() targets flow_data[cap-color].map(label_map).tolist() values flow_data[count].tolist()fig go.Figure(data[go.Sankey(nodedict(pad15,thickness20,linedict(colorblack, width0.5),labellabels),linkdict(sourcesources,targettargets,valuevalues) )])fig.update_layout(title_textSankey Chart of Cap Shape to Cap Color Flow,title_x0.5, width1000, height600 ) fig.show() feature_counts df_train1.groupby([cap-shape, cap-color]).size().reset_index(namecount) fig px.bar(feature_counts, xcap-shape, ycount, colorcap-color,titleCrosstab Chart of Cap Shape and Cap Color,labels{cap-shape: Cap Shape, count: Count, cap-color: Cap Color},color_discrete_sequencepx.colors.qualitative.Plotly,textcount) fig.update_layout(title_textCrosstab Chart of Cap Shape and Cap Color,title_x0.5, xaxis_titleCap Shape,yaxis_titleCount,barmodestack )fig.show() cols_to_drop_train missing_train[missing_train 95].index cols_to_drop_test missing_test[missing_test 95].indexdf_train df_train.drop(columnscols_to_drop_train) df_test df_test.drop(columnscols_to_drop_test) gc.collect() 您的代码片段包含了一些拼写错误和语法问题。下面是修正后的代码包含了处理缺失值、计算相关性矩阵、绘制太阳图sunburst chart、桑基图sankey diagram和柱状图bar chart的功能并删除了缺失值比例过高的列。完整的代码 # 安装 dython 包 !pip install dython# 导入必要的库 from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder, OrdinalEncoder import category_encoders as ce import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import dython import plotly.express as px import plotly.graph_objects as go import gc# 设置缺失值的阈值 missing_threshold 0.95# 获取缺失值比例超过阈值的列 high_missing_columns df_train.columns[df_train.isnull().mean() missing_threshold]# 删除缺失值比例过高的列 df_train df_train.drop(columnshigh_missing_columns) df_test df_test.drop(columnshigh_missing_columns)# 设定目标变量 target class# 遍历 DataFrame 中的每一列并处理缺失值 for column in df_train.columns:if df_train[column].isnull().any(): # 如果该列存在缺失值if df_train[column].dtype object: # 如果是类别型数据mode_value df_train[column].mode()[0] # 使用众数填充df_train[column].fillna(mode_value, inplaceTrue)df_test[column].fillna(mode_value, inplaceTrue)else: # 如果是数值型数据median_value df_train[column].median() # 使用中位数填充df_train[column].fillna(median_value, inplaceTrue)df_test[column].fillna(median_value, inplaceTrue)# 计算相关性矩阵 from dython.nominal import associations associations_df associations(df_train[:10000], nominal_columnsall, plotFalse) corr_matrix associations_df[corr]# 绘制相关性矩阵热力图 plt.figure(figsize(20, 8)) plt.gcf().set_facecolor(#FFFDD0) sns.heatmap(corr_matrix, annotTrue, fmt.2f, cmapcoolwarm, linewidths0.5) plt.title(Correlation Matrix including Categorical Features) plt.show()# 绘制太阳图 df_train1 df_train[:10000].copy() feature_counts df_train1.groupby([cap-shape, cap-color]).size().reset_index(namecount)fig px.sunburst(feature_counts,path[cap-shape, cap-color],valuescount,colorcount,color_continuous_scaleViridis,titleSunburst Chart of Cap Shape and Cap Color Distribution ) fig.update_layout(title_textSunburst Chart of Cap Shape and Cap Color Distribution,title_x0.5,width900,height600 ) fig.show()# 绘制桑基图 flow_data df_train1.groupby([cap-shape, cap-color]).size().reset_index(namecount) labels list(pd.concat([flow_data[cap-shape], flow_data[cap-color]]).unique()) label_map {label: idx for idx, label in enumerate(labels)}sources flow_data[cap-shape].map(label_map).tolist() targets flow_data[cap-color].map(label_map).tolist() values flow_data[count].tolist()fig go.Figure(data[go.Sankey(nodedict(pad15,thickness20,linedict(colorblack, width0.5),labellabels),linkdict(sourcesources,targettargets,valuevalues) )])fig.update_layout(title_textSankey Chart of Cap Shape to Cap Color Flow,title_x0.5, width1000, height600 ) fig.show()# 绘制柱状图 feature_counts df_train1.groupby([cap-shape, cap-color]).size().reset_index(namecount) fig px.bar(feature_counts,xcap-shape,ycount,colorcap-color,titleCrosstab Chart of Cap Shape and Cap Color,labels{cap-shape: Cap Shape, count: Count, cap-color: Cap Color},color_discrete_sequencepx.colors.qualitative.Plotly,textcount ) fig.update_layout(title_textCrosstab Chart of Cap Shape and Cap Color,title_x0.5, xaxis_titleCap Shape,yaxis_titleCount,barmodestack ) fig.show()# 删除缺失值比例过高的列 cols_to_drop_train high_missing_columns cols_to_drop_test high_missing_columnsdf_train df_train.drop(columnscols_to_drop_train) df_test df_test.drop(columnscols_to_drop_test)# 触发垃圾回收 gc.collect() 代码解释安装 dython 包 !pip install dython 导入必要的库 from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder, OrdinalEncoder import category_encoders as ce import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import dython import plotly.express as px import plotly.graph_objects as go import gc 设置缺失值的阈值并删除缺失值比例过高的列 missing_threshold 0.95 high_missing_columns df_train.columns[df_train.isnull().mean() missing_threshold] df_train df_train.drop(columnshigh_missing_columns) df_test df_test.drop(columnshigh_missing_columns) 处理剩余的缺失值 target class for column in df_train.columns:if df_train[column].isnull().any(): if df_train[column].dtype object: mode_value df_train[column].mode()[0] df_train[column].fillna(mode_value, inplaceTrue)df_test[column].fillna(mode_value, inplaceTrue)else: median_value df_train[column].median() df_train[column].fillna(median_value, inplaceTrue)df_test[column].fillna(median_value, inplaceTrue) 计算相关性矩阵并绘制热力图 associations_df associations(df_train[:10000], nominal_columnsall, plotFalse) corr_matrix associations_df[corr] plt.figure(figsize(20, 8)) plt.gcf().set_facecolor(#FFFDD0) sns.heatmap(corr_matrix, annotTrue, fmt.2f, cmapcoolwarm, linewidths0.5) plt.title(Correlation Matrix including Categorical Features) plt.show() 绘制太阳图 df_train1 df_train[:10000].copy() feature_counts df_train1.groupby([cap-shape, cap-color]).size().reset_index(namecount) fig px.sunburst(feature_counts, path[cap-shape, cap-color], valuescount, colorcount, color_continuous_scaleViridis, titleSunburst Chart of Cap Shape and Cap Color Distribution) fig.update_layout(title_textSunburst Chart of Cap Shape and Cap Color Distribution, title_x0.5, width900, height600) fig.show() 绘制桑基图 flow_data df_train1.groupby([cap-shape, cap-color]).size().reset_index(namecount) labels list(pd.concat([flow_data[cap-shape], flow_data[cap-color]]).unique()) label_map {label: idx for idx, label in enumerate(labels)} sources flow_data[cap-shape].map(label_map).tolist() targets flow_data[cap-color].map(label_map).tolist() values flow_data[count].tolist() fig go.Figure(data[go.Sankey(nodedict(pad15, thickness20, linedict(colorblack, width0.5), labellabels), linkdict(sourcesources, targettargets, valuevalues))]) fig.update_layout(title_textSankey Chart of Cap Shape to Cap Color Flow, title_x0.5, width1000, height600) fig.show() 绘制柱状图 feature_counts df_train1.groupby([cap-shape, cap-color]).size().reset_index(namecount) fig px.bar(feature_counts, xcap-shape, ycount, colorcap-color, titleCrosstab Chart of Cap Shape and Cap Color, labels{cap-shape: Cap Shape, count: Count, cap-color: Cap Color}, color_discrete_sequencepx.colors.qualitative.Plotly, textcount) fig.update_layout(title_textCrosstab Chart of Cap Shape and Cap Color, title_x0.5, xaxis_titleCap Shape, yaxis_titleCount, barmodestack) fig.show() 再次删除缺失值比例过高的列 cols_to_drop_train high_missing_columns cols_to_drop_test high_missing_columns df_train df_train.drop(columnscols_to_drop_train) df_test df_test.drop(columnscols_to_drop_test) 触发垃圾回收 gc.collect() 四种常见的分类特征编码方法及其优缺点以及适用场景这张图片展示了一张表格其中总结了四种常见的分类特征编码方法及其优缺点以及适用场景 Label Encoding Advantages: 简单易用适用于有序数据。Disadvantages: 编码顺序可能具有任意性可能会引入偏见。Use Cases: 适用于有序数据和基于树的算法。 One-Hot Encoding Advantages: 可防止序数偏差适用于名义数据。Disadvantages: 增加维度稀疏表示。Use Cases: 适用于名义数据、线性和距离度量为基础的算法。 Binary Encoding Advantages: 减少维度适合高基数的数据集。Disadvantages: 需要额外处理可能导致序数偏差。Use Cases: 适用于大样本集和基于树的算法。 Target Encoding Advantages: 根据目标变量进行编码捕捉复杂的关系。Disadvantages: 易于过拟合和泄露需要目标变量。Use Cases: 当变量与目标之间有强烈关系时用于监督学习任务。让我们更深入地了解这些编码技术 Label Encoding: 这种方法为每个类标签分配一个唯一的数字。例如如果类别A、B和C分别映射到1、2和3则类别之间的相对顺序可能会被误解。对于某些算法如决策树这可能是合适的因为它们可以理解这种顺序。然而在其他情况下这种方法可能会导致模型对数据中的无意义顺序产生依赖。One-Hot Encoding: 这是一种将离散特征转换为多个二进制特征的技术。它通过创建一个新特征来表示每个可能的类别然后将对应类别的值设为1其余类别的值设为0。这种方法避免了序数偏差因为它不会在类别之间引入任何排序信息。然而它会导致维度爆炸特别是在类别数量很大的情况下。Binary Encoding: 这种方法使用较少的二进制特征来表示原始类别。例如如果有四个类别我们可以使用两个二进制特征来表示它们。这种方法减少了维度但它需要额外的预处理步骤来构建和解码二进制表示。此外它可能会引入序数偏差因为不同的二进制组合可能看起来像是具有某种顺序。Target Encoding: 这种方法根据目标变量的分布来替换类别。例如类别A可以被其平均目标值所取代。这种方法能够捕获复杂的类别间关系但容易出现过拟合和数据泄漏问题。为了减少这些问题通常会使用平滑或噪声添加等技术。由于它依赖于目标变量因此仅适用于监督学习任务。选择哪种编码方法取决于具体的应用场景和数据特性。例如如果数据集具有大量类别且内存有限那么二进制编码可能是最佳选择如果类别之间没有自然顺序并且希望最小化维度增加那么one-hot编码可能是更好的选择。 from sklearn.impute import KNNImputer import pandas as pddef knn_impute(df, n_neighbors5): df_encoded df.copy()for col in df_encoded.select_dtypes(includeobject).columns:df_encoded[col] df_encoded[col].astype(category).cat.codesknn_imputer KNNImputer(n_neighborsn_neighbors)df_imputed pd.DataFrame(knn_imputer.fit_transform(df_encoded), columnsdf_encoded.columns)for col in df.select_dtypes(includeobject).columns:df_imputed[col] df_imputed[col].round().astype(int).map(dict(enumerate(df[col].astype(category).cat.categories)))return df_imputeddf_train_imputed knn_impute(df_train, n_neighbors5) df_test_imputed knn_impute(df_test, n_neighbors5)cat_cols_train df_train_imputed.select_dtypes(include[object]).columns cat_cols_train cat_cols_train[cat_cols_train ! class] ordinal_encoder OrdinalEncoder(handle_unknownuse_encoded_value, unknown_value-1)df_train_imputed[cat_cols_train] ordinal_encoder.fit_transform(df_train_imputed[cat_cols_train].astype(str)) df_test_imputed[cat_cols_train] ordinal_encoder.transform(df_test_imputed[cat_cols_train].astype(str)) df_train_imputed.head() df_test_imputed.head() df_train df_train_imputed df_test df_test_imputed df_test.head() le LabelEncoder() df_train[class] le.fit_transform(df_train[class]) 您提供的代码实现了使用 KNNK-Nearest Neighbors方法对数据集中的缺失值进行填充并对分类特征进行了编码。下面是代码的解释和修正代码解释定义 KNN Imputation 函数使用 KNNImputer 对数值型和经过编码的分类特征进行缺失值填充。对分类特征进行编码并在填充后恢复为原来的类别标签。应用 KNN Imputation 对训练集和测试集应用 KNN Imputation。对分类特征进行 Ordinal Encoding 使用 OrdinalEncoder 对分类特征进行编码。对目标变量进行 Label Encoding 使用 LabelEncoder 对目标变量进行编码。修正后的代码以下是修正后的代码 from sklearn.impute import KNNImputer from sklearn.preprocessing import OrdinalEncoder, LabelEncoder import pandas as pddef knn_impute(df, n_neighbors5):df_encoded df.copy()# 将分类特征转换为编码for col in df_encoded.select_dtypes(includeobject).columns:df_encoded[col] df_encoded[col].astype(category).cat.codes# 创建 KNN Imputer 并进行填充knn_imputer KNNImputer(n_neighborsn_neighbors)df_imputed pd.DataFrame(knn_imputer.fit_transform(df_encoded), columnsdf_encoded.columns)# 将编码后的分类特征恢复为原始类别for col in df.select_dtypes(includeobject).columns:df_imputed[col] df_imputed[col].round().astype(int).map(dict(enumerate(df[col].astype(category).cat.categories)))return df_imputed# 应用 KNN Imputation df_train_imputed knn_impute(df_train, n_neighbors5) df_test_imputed knn_impute(df_test, n_neighbors5)# 查看填充后的训练集和测试集 print(Imputed Training Set Head:) print(df_train_imputed.head()) print(\nImputed Testing Set Head:) print(df_test_imputed.head())# 获取分类特征列名 cat_cols_train df_train_imputed.select_dtypes(include[object]).columns cat_cols_train cat_cols_train[cat_cols_train ! class]# 使用 Ordinal Encoder 对分类特征进行编码 ordinal_encoder OrdinalEncoder(handle_unknownuse_encoded_value, unknown_value-1) df_train_imputed[cat_cols_train] ordinal_encoder.fit_transform(df_train_imputed[cat_cols_train].astype(str)) df_test_imputed[cat_cols_train] ordinal_encoder.transform(df_test_imputed[cat_cols_train].astype(str))# 查看编码后的训练集和测试集 print(\nEncoded Training Set Head:) print(df_train_imputed.head()) print(\nEncoded Testing Set Head:) print(df_test_imputed.head())# 更新原始 DataFrame df_train df_train_imputed df_test df_test_imputed# 对目标变量进行 Label Encoding le LabelEncoder() df_train[class] le.fit_transform(df_train[class])# 查看最终的训练集 print(\nFinal Training Set Head:) print(df_train.head()) 代码说明定义 KNN Imputation 函数将分类特征转换为编码。使用 KNNImputer 对缺失值进行填充。将编码后的分类特征恢复为原始类别。应用 KNN Imputation 对训练集和测试集应用 KNN Imputation。对分类特征进行 Ordinal Encoding 使用 OrdinalEncoder 对分类特征进行编码。对目标变量进行 Label Encoding 使用 LabelEncoder 对目标变量进行编码。输出运行以上代码后您将看到填充后的训练集和测试集头部信息。编码后的训练集和测试集头部信息。最终的训练集头部信息包括对目标变量进行编码后的结果。请确保在运行此代码之前已经加载了 df_train 和 df_test 数据集并且这些数据集中存在分类特征和目标变量。如果您的数据集中没有名为 class 的列请替换为您实际的目标列名。 Observation from above plot ]: y df_train[class] X df_train.drop([class],axis1) 您提到的“above plot”指的是之前生成的相关性矩阵热力图、太阳图、桑基图和柱状图。虽然我没有直接看到这些图表的具体内容但我可以根据您提供的代码和描述来进行一些通用的观察。假设我们已经生成了相关性矩阵热力图、太阳图、桑基图和柱状图并且现在您想要从这些图表中得出一些观察结论。以下是针对这些图表的一些可能的观察和结论相关性矩阵热力图观察点相关性矩阵显示了各个特征之间的相关性程度。热力图中的颜色深浅代表了相关性的强度正值表示正相关负值表示负相关。可能的结论特征间相关性哪些特征之间存在较高的正相关性或负相关性这有助于了解特征之间的依赖关系。特征重要性哪些特征与目标变量class的相关性较高这些特征可能是重要的预测变量。多重共线性是否存在高度相关的特征对这可能表明存在多重共线性问题需要在建模时加以注意。太阳图Sunburst Chart 观察点太阳图展示了不同特征组合下的分布情况。不同的层级代表不同的特征值组合。可能的结论特征组合分布哪些特征组合的分布较为均匀哪些特征组合的分布较为集中特征间交互作用不同特征之间的组合是否表现出明显的交互作用桑基图Sankey Diagram 观察点桑基图展示了特征值之间的流向。不同节点的大小代表了对应的特征值出现的频率。可能的结论特征值流向哪些特征值之间的流向最为显著这有助于了解特征之间的相互影响。特征值频率哪些特征值出现得最频繁这反映了数据集中的主要特征值分布情况。柱状图Bar Chart 观察点柱状图展示了不同特征值的频数分布。不同颜色的柱子代表不同的类别。可能的结论特征值频数哪些特征值的频数较高这反映了数据集中最常见的特征值。特征分布不同特征值之间的分布是否均匀是否存在某些特征值显著高于其他特征值的情况数据准备接下来您准备将数据集分为特征矩阵 X 和目标向量 y以便进行建模。以下是相应的代码 y df_train[class] X df_train.drop([class], axis1) 解释 y 是目标变量class即您希望预测的列。X 是特征矩阵即除了目标变量之外的所有其他特征组成的 DataFrame。 v您使用的 train_test_split 函数是从 sklearn.model_selection 模块中导入的用于将数据集划分为训练集和测试集。以下是对您提供的代码的解释以及如何正确使用它代码解释 from sklearn.model_selection import train_test_split# 假设 X 和 y 已经按照之前的步骤定义好了 train_X, test_X, train_y, test_y train_test_split(X, y, test_size0.2, random_state42, stratifyy) 参数说明 X: 特征矩阵包含除目标变量外的所有特征。y: 目标向量即您要预测的变量。test_size: 测试集占总样本的比例默认是 0.25这里设置为 0.2意味着测试集占总样本的 20%。random_state: 随机种子保证每次划分得到相同的结果便于复现实验。stratifyy: 这个参数用于保持训练集和测试集中的目标变量分布一致。对于分类任务来说这可以确保训练集和测试集中的各类别比例相等从而避免偏差。代码执行这段代码会将特征矩阵 X 和目标向量 y 划分为训练集 (train_X, train_y) 和测试集 (test_X, test_y)。stratifyy 参数保证了训练集和测试集的目标变量分布一致这对于分类任务尤其重要因为它可以防止训练集和测试集中的类别不平衡。示例完整代码为了确保这段代码能够正常运行下面是一个完整的示例假设 X 和 y 已经定义好 # 导入必要的模块 from sklearn.model_selection import train_test_split# 假设 X 和 y 已经按照之前的步骤定义好了 # y df_train[class] # X df_train.drop([class], axis1)# 划分数据集 train_X, test_X, train_y, test_y train_test_split(X, y, test_size0.2, random_state42, stratifyy)# 输出划分后的数据集形状 print(Train X Shape:, train_X.shape) print(Test X Shape:, test_X.shape) print(Train y Shape:, train_y.shape) print(Test y Shape:, test_y.shape) 输出当您运行上述代码时您应该会看到如下输出 Train X Shape: (number_of_rows_in_train_set, number_of_features) Test X Shape: (number_of_rows_in_test_set, number_of_features) Train y Shape: (number_of_rows_in_train_set,) Test y Shape: (number_of_rows_in_test_set,) 这里的 (number_of_rows_in_train_set) 和 (number_of_rows_in_test_set) 分别是训练集和测试集中样本的数量而 (number_of_features) 是特征的数量 Classification Analysis 看起来您想进行分类分析但是输入的内容有些模糊。分类分析是一种统计方法用于预测类别变量的类别。在机器学习中分类任务涉及训练模型来预测给定输入数据的类别标签。以下是进行分类分析的一般步骤数据准备加载数据集。清洗和预处理数据如填补缺失值、编码分类变量等。划分数据集为训练集和测试集。特征工程选择合适的特征。可能需要进行特征缩放或归一化。选择模型根据问题的特点选择合适的分类算法如逻辑回归、决策树、随机森林、支持向量机SVM、神经网络等。训练模型使用训练集数据训练选定的模型。评估模型在测试集上评估模型的性能。使用各种指标如准确率Accuracy、精确度Precision、召回率Recall、F1 分数F1 Score、混淆矩阵Confusion Matrix等。调整参数如果模型性能不佳可以通过调整模型参数或使用交叉验证Cross Validation进行超参数调优。最终评估使用最佳参数重新训练模型。在独立的验证集或完全未见过的数据上进行最终评估。示例代码假设您已经完成了数据预处理并且已经划分了训练集和测试集 (train_X, test_X, train_y, test_y)下面是一个使用随机森林分类器进行分类分析的例子 from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, confusion_matrix, accuracy_score# 初始化分类器 classifier RandomForestClassifier(random_state42)# 训练模型 classifier.fit(train_X, train_y)# 预测测试集 predictions classifier.predict(test_X)# 打印分类报告 print(Classification Report:\n, classification_report(test_y, predictions))# 计算并打印混淆矩阵 print(Confusion Matrix:\n, confusion_matrix(test_y, predictions))# 计算并打印准确率 print(Accuracy Score:, accuracy_score(test_y, predictions)) 解释初始化分类器在这里我们选择了随机森林分类器并设置了随机种子以确保结果的可重复性。训练模型使用训练集数据 (train_X, train_y) 来训练模型。预测测试集使用训练好的模型对测试集 (test_X) 进行预测。评估模型通过打印分类报告、混淆矩阵和准确率来评估模型的性能。结果解释分类报告Classification Report提供了每个类别的精确度、召回率和 F1 分数以及所有类别的平均指标。混淆矩阵Confusion Matrix显示了模型预测的真实标签和预测标签的对比情况。准确率Accuracy Score计算了模型正确预测的样本占总样本的比例。 XGBoost二分类任务 from sklearn.metrics import matthews_corrcoef def mcc_metric(y_pred, dmatrix):y_true dmatrix.get_label()y_pred (y_pred 0.5).astype(int) mcc matthews_corrcoef(y_true, y_pred)return mcc, mccfrom sklearn.metrics import matthews_corrcoef from xgboost import XGBClassifiermodel XGBClassifier( alpha0.1, subsample0.8, colsample_bytree0.6, objectivebinary:logistic,max_depth14, min_child_weight7, gamma1e-6, #random_state42, n_estimators100)XGB model.fit(train_X, train_y, eval_set[(test_X, test_y)],eval_metricmcc_metric)y_pred XGB.predict(test_X)import lime import lime.lime_tabular redict_fn_xgb lambda x: XGB.predict_proba(x).astype(float) X train_X.values explainer lime.lime_tabular.LimeTabularExplainer(X,feature_names train_X.columns,class_names[Poisnous,ediblenel_width5)df_test.head(4)df_test.loc[[3]] test_X choosen_instance test_X.loc[[1584520]].values[0] exp explainer.explain_instance(choosen_instance, redict_fn_xgb,num_features15) exp.show_in_notebook(show_allFalse)choosen_instance test_X.loc[[2244255]].values[0] exp explainer.explain_instance(choosen_instance, redict_fn_xgb,num_features15) exp.show_in_notebook(show_allFalse)choosen_instance test_X.loc[[421615]].values[0] exp explainer.explain_instance(choosen_instance, redict_fn_xgb,num_features15) exp.show_in_notebook(show_allFalse)choosen_instance test_X.loc[[2921070]].values[0] exp explainer.explain_instance(choosen_instance, redict_fn_xgb,num_features15) exp.show_in_notebook(show_allFalse)choosen_instance test_X.loc[[2921070]].values[0] exp explainer.explain_instance(choosen_instance, redict_fn_xgb,num_features15) exp.show_in_notebook(show_allFalse)score matthews_corrcoef(test_y, y_pred) print(MCC:, score)test_pred_prob XGB.predict(df_test)test_pred_prob #test_pred_binary (test_pred_prob 0.5).astype(int) test_pred_class le.inverse_transform(test_pred_prob) df_sub[class] test_pred_class 您提供的代码旨在使用 XGBoost 进行二分类任务并利用 Matthews 相关系数MCC作为评价指标。此外代码还使用了 LIME局部可解释的模型解释技术来解释模型的预测结果。下面是代码的解释和修正代码解释定义 MCC 评价函数自定义一个函数 mcc_metric 用于计算 Matthews 相关系数MCC。初始化 XGBoost 模型设置模型参数并实例化 XGBoost 分类器。训练模型使用训练数据拟合模型并在训练过程中使用自定义的 MCC 评价函数进行评估。预测测试集使用训练好的模型对测试集进行预测。使用 LIME 解释预测结果定义一个预测函数 redict_fn_xgb 用于 LIME。使用 LimeTabularExplainer 初始化解释器。使用 explain_instance 方法解释指定实例的预测结果。评估模型计算测试集上的 Matthews 相关系数MCC。生成预测概率使用模型生成测试集的预测概率并将其转换为目标变量的类别标签。修正后的代码以下是修正后的代码 from sklearn.metrics import matthews_corrcoef from xgboost import XGBClassifier import lime import lime.lime_tabular# 自定义 MCC 评价函数 def mcc_metric(y_pred, dmatrix):y_true dmatrix.get_label()y_pred (y_pred 0.5).astype(int)mcc matthews_corrcoef(y_true, y_pred)return mcc, mcc# 初始化 XGBoost 模型 model XGBClassifier(alpha0.1,subsample0.8,colsample_bytree0.6,objectivebinary:logistic,max_depth14,min_child_weight7,gamma1e-6,n_estimators100 )# 训练模型 XGB model.fit(train_X, train_y, eval_set[(test_X, test_y)],eval_metricmcc_metric )# 预测测试集 y_pred XGB.predict(test_X)# 使用 LIME 解释预测结果 predict_fn_xgb lambda x: XGB.predict_proba(x).astype(float) X train_X.values explainer lime.lime_tabular.LimeTabularExplainer(X, feature_namestrain_X.columns, class_names[Poisonous, Edible],kernel_width5 )# 选择实例并解释 choosen_instances [1584520, 2244255, 421615, 2921070] for instance_id in choosen_instances:choosen_instance test_X.loc[[instance_id]].values[0]exp explainer.explain_instance(choosen_instance, predict_fn_xgb, num_features15)exp.show_in_notebook(show_allFalse)# 计算 MCC score matthews_corrcoef(test_y, y_pred) print(MCC:, score)# 生成预测概率 test_pred_prob XGB.predict_proba(df_test)[:, 1]# 将预测概率转换为目标变量的类别标签 test_pred_class le.inverse_transform((test_pred_prob 0.5).astype(int))# 将预测结果存入 DataFrame df_sub[class] test_pred_class 注意事项预测函数predict_fn_xgb 函数返回的是概率值因此需要选择第二列通常是正类的概率。类别名称class_names 中的类别名称应该是您数据集中的真实类别名称这里假设是 Poisonous 和 Edible。实例 ID确保 choosen_instances 中的实例 ID 是有效的即它们存在于 test_X 中。Kernel Widthkernel_width 参数用于控制解释器的局部范围。预测转换在将预测概率转换为类别标签时使用了阈值 0.5。结果展示由于 LIME 的结果通常是在 Jupyter Notebook 中展示的因此 exp.show_in_notebook(show_allFalse) 会在笔记本环境中显示解释结果。如果您不在 Jupyter Notebook 环境中工作可能需要考虑其他方式来查看解释结果。 ◇目Final Submission ∞ []: df_sub.to_csv(submission.csv,index False) pd.read_csv(submission.csv) []: df_sub[class].hist() 您希望完成最终的提交并将结果保存到 CSV 文件中然后读取该文件并绘制直方图来检查结果。以下是修正后的代码修正后的代码保存结果到 CSV 文件 df_sub.to_csv(submission.csv, indexFalse) 读取 CSV 文件 submission_df pd.read_csv(submission.csv) 绘制直方图 import matplotlib.pyplot as pltplt.figure(figsize(10, 6)) df_sub[class].hist(bins20, alpha0.7) plt.title(Distribution of Predicted Classes) plt.xlabel(Class) plt.ylabel(Frequency) plt.grid(True) plt.show() 完整代码以下是完整的代码示例 import pandas as pd import matplotlib.pyplot as plt# 假设 df_sub 已经定义好了并包含了预测结果 # df_sub[class] test_pred_class# 保存结果到 CSV 文件 df_sub.to_csv(submission.csv, indexFalse)# 读取 CSV 文件 submission_df pd.read_csv(submission.csv)# 绘制直方图 plt.figure(figsize(10, 6)) submission_df[class].hist(bins20, alpha0.7) plt.title(Distribution of Predicted Classes) plt.xlabel(Class) plt.ylabel(Frequency) plt.grid(True) plt.show() 代码解释保存结果到 CSV 文件使用 to_csv 方法将 DataFrame df_sub 保存到 submission.csv 文件中indexFalse 表示不保存索引列。读取 CSV 文件使用 pd.read_csv 方法读取保存的 CSV 文件并将其存储在 submission_df 中。绘制直方图使用 Matplotlib 绘制 df_sub[class] 的直方图以检查预测结果的分布情况。bins20 表示直方图的柱子数量。alpha0.7 表示柱子的透明度。plt.title, plt.xlabel, plt.ylabel 分别设置图表的标题、x 轴标签和 y 轴标签。plt.grid(True) 添加网格线使图表更易读。注意事项确保 df_sub DataFrame 包含了正确的预测结果并且 class 列已经被正确设置。如果 df_sub 中的 class 列是连续的预测概率值则需要先将其转换为离散的类别标签例如通过设置阈值。如果 df_sub 中的 class 列已经是离散的类别标签则可以直接绘制直方图。

查看全文

http://www.hkea.cn/news/14278576/