当前位置：首页 > news >正文

杭州网站设计予尚网站建设参数

news 2026/4/21 13:23:34

杭州网站设计予尚,网站建设参数,开发公众号微网站开发,空壳网站查询ydata-profiling介绍与使用 ydata-profiling的作用ydata-profiling的安装与简单使用ydata-profiling的结果结构 ydata-profiling的实际应用场景1. 数据集比较2. 时间序列报告3. 对大型数据集进行概要分析4. 处理敏感数据5. 自定义报告的外观 ydata-profiling的作用 ydata-prof… ydata-profiling介绍与使用 ydata-profiling的作用ydata-profiling的安装与简单使用ydata-profiling的结果结构 ydata-profiling的实际应用场景1. 数据集比较2. 时间序列报告3. 对大型数据集进行概要分析4. 处理敏感数据5. 自定义报告的外观 ydata-profiling的作用 ydata-profiling的主要目标是提供一种简洁而快速的探索性数据分析EDA体验。就像pandas中的df.describe()函数非常方便一样ydata-profiling可以对DataFrame进行扩展分析并允许将数据分析导出为不同格式例如html和json。该软件包输出了一个简单而易于理解的数据集分析结果包括时间序列和文本数据。 ydata-profiling的安装与简单使用 1. 安装 pip install ydata-profiling2. 使用 import numpy as np import pandas as pd from ydata_profiling import ProfileReportdf pd.DataFrame(np.random.rand(100, 5), columns[a,b,c,d,e]) profile ProfileReport(df, titleProfiling Report)ydata-profiling的结果结构 ydata-profiling的结果会使用一些关键属性类型推断 (Type inference)自动检测列的数据类型分类、数值、日期等警告 (Warning)对数据中可能需要处理的问题/挑战的概要缺失数据、不准确性、偏斜等单变量分析 (Univariate analysis)包括描述性统计量平均值、中位数、众数等和信息可视化如分布直方图多变量分析 (Multivariate analysis)包括相关性分析、详细分析缺失数据、重复行并为变量之间的交互提供视觉支持时间序列 (Time-Series)包括与时间相关的不同统计信息例如自相关和季节性以及ACF和PACF图。文本分析 (Text analysis)最常见的类别大写、小写、分隔符、脚本拉丁文、西里尔文和区块ASCII、西里尔文文件和图像分析 (File and Image analysis)文件大小、创建日期、指示截断图像和存在EXIF元数据的指示比较数据集 (Compare datasets)一行命令快速生成完整的数据集比较报告灵活的输出格式 (Flexible output formats)所有分析结果可以导出为HTML报告便于与各方共享也可作为JSON用于轻松集成到自动化系统中还可以作为Jupyter Notebook中的小部件使用报告还包含三个额外的部分概述 (Overview)主要提供有关数据集的全局详细信息记录数、变量数、整体缺失值和重复值、内存占用情况警告 (Alerts)一个全面且自动的潜在数据质量问题列表高相关性、偏斜、一致性、零值、缺失值、常数值等重现 (Reporduction)分析的技术细节时间、版本和配置 ydata-profiling的实际应用场景 1. 数据集比较 ydata-profiling可以用于比较同一数据集的多个版本。当需要对比不同时间段如两年的数据时这非常有用。另一个常见的场景是在机器学习中查看训练、验证和测试数据集的数据概况。例如 from ydata_profiling import ProfileReporttrain_df pd.read_csv(train.csv) train_report ProfileReport(train_df, titleTrain)test_df pd.read_csv(test.csv) test_report ProfileReport(test_df, titleTest)comparison_report train_report.compare(test_report) comparison_report.to_file(comparison.html)比较报告使用设置中的标题属性作为标签。颜色在settings.html.style.primary_colors中进行配置。可以通过调整numeric precision参数settings.report.precision来获得报告中的一些额外空间。当比较多个报告时 from ydata_profiling import ProfileReport, comparecomparison_report compare([train_report, validation_report, test_report])# Obtain merged statistics statistics comparison_report.get_description()# Save report to file comparison_report.to_file(comparison.html)请注意此功能仅确保支持对两个数据集进行比较的报告。可以获取统计信息但报告可能存在格式问题。其中一个可以更改的设置是settings.report.precision。根据经验可以将值10用于单个报告将值8用于比较两个报告。 2. 时间序列报告 pandas-profiling可以用于对时间序列数据进行快速的探索性数据分析。这对于快速了解与时间相关的变量的行为如时间图、季节性、趋势和平稳性非常有用。结合profiling reports compare您可以比较时间上的演变和数据行为以时间序列特定统计信息如PACF和ACF图为基础。以下语法可用于在假设数据集包含时间相关特征的情况下生成概要报告 import pandas as pdfrom ydata_profiling.utils.cache import cache_file from ydata_profiling import ProfileReportfile_name cache_file(pollution_us_2000_2016.csv,https://query.data.world/s/mz5ot3l4zrgvldncfgxu34nda45kvb, )df pd.read_csv(file_name, index_col[0])# Filtering time-series to profile a single site site df[df[Site Num] 3003]profile ProfileReport(df, tsmodeTrue, sortbyDate Local, titleTime-Series EDA)profile.to_file(report_timeseries.html)要生成时间序列报告需要将ts_mode设置为“True”。如果设置为“True”那些具有时间依赖性的变量将根据自相关的存在自动识别出来。时间序列报告使用sortby属性对数据集进行排序。如果未提供此属性则假定数据集已经按顺序排列。在某些情况下您可能已经清楚哪些变量应该是时间序列或者您只想确保您希望作为时间序列进行分析的变量被正确地进行概要分析 import pandas as pdfrom ydata_profiling.utils.cache import cache_file from ydata_profiling import ProfileReportfile_name cache_file(pollution_us_2000_2016.csv,https://query.data.world/s/mz5ot3l4zrgvldncfgxu34nda45kvb, )df pd.read_csv(file_name, index_col[0])# Filtering time-series to profile a single site site df[df[Site Num] 3003]# Setting what variables are time series type_schema {NO2 Mean: timeseries,NO2 1st Max Value: timeseries,NO2 1st Max Hour: timeseries,NO2 AQI: timeseries,cos: numeric,cat: numeric, }profile ProfileReport(df,tsmodeTrue,type_schematype_schema,sortbyDate Local,titleTime-Series EDA for site 3003, )profile.to_file(report_timeseries.html)3. 对大型数据集进行概要分析默认情况下ydata-profiling以最能提供数据分析洞察的方式全面总结输入数据集。对于小型数据集这些计算可以准实时进行。对于较大的数据集可能需要事先决定要进行哪些计算。一个计算是否适用于大型数据集不仅取决于数据集的确切大小还取决于其复杂性以及是否可用快速计算。如果概要分析的计算时间成为瓶颈ydata-profiling提供了几种解决方案来克服这一问题。 3.1 最小模式 ydata-profiling包含一个最小配置文件默认情况下关闭了最费力的计算。这是处理较大数据集的推荐起点。 profile ProfileReport(large_dataset, minimalTrue) profile.to_file(output.html)3.2 对数据集取样处理非常大型数据集的另一种方法是使用其中一部分数据生成概要分析报告。一些用户报告称这是在保持代表性的同时缩短计算时间的好方法。 sample large_dataset.sample(10000)profile ProfileReport(sample, minimalTrue) profile.to_file(output.html)报告的读者可能想了解概要分析是使用数据样本生成的。可以通过向报告添加描述来说明这一点。 description Disclaimer: this profiling report was generated using a sample of 5% of the original dataset. sample large_dataset.sample(frac0.05)profile sample.profile_report(dataset{description: description}, minimalTrue) profile.to_file(output.html)3.3 禁用费力的计算为了减少特别大型数据集中的计算负担但仍然保留可能来自它们的一些感兴趣的信息可以仅针对某些列过滤一些计算。特别地可以提供一个目标列表给Interactions以便仅计算与这些特定变量有关的交互作用。 from ydata_profiling import ProfileReport import pandas as pd# Reading the data data pd.read_csv(https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv )# Creating the profile without specifying the data source, to allow editing the configuration profile ProfileReport() profile.config.interactions.targets [Name, Sex, Age]# Assigning a DataFrame and exporting to a file, triggering computation profile.df data profile.to_file(report.html)控制此设置的interactions.targets 可以通过多个接口进行更改配置文件或环境变量。 3.4 并发性 ydata-profiling是一个正在积极开发的项目。其中一个非常期望的功能是添加可扩展的后端例如Modin或Dask。 4. 处理敏感数据在某些数据敏感的背景下例如私人健康记录分享包含样本的报告可能会违反隐私约束。以下配置简写将各种选项分组在一起以便在报告中只提供聚合信息而不显示个人记录 report df.profile_report(sensitiveTrue)此外pandas-profiling不会将数据发送到外部服务因此非常适合处理私人数据。 4.1 样本和重复值可以禁用显示数据集样本和重复行的功能以确保报告不会直接泄漏任何数据 report df.profile_report(duplicatesNone, samplesNone)或者仍然可以显示一个样本但以下代码片段演示了如何生成报告但在数据集样本部分使用模拟/合成数据。请注意name和caption键是可选的。 # Replace with the sample youd like to present in the report (can be from a mock or synthetic data generator) sample_custom_data pd.DataFrame() sample_description Disclaimer: the following sample consists of synthetic data following the format of the underlying dataset.report df.profile_report(sample{name: Mock data sample,data: sample_custom_data,caption: sample_description,} )4.2 数据集元数据、数据字典和配置当与同事共享报告或在网上发布时包含数据集的元数据如作者、版权持有人或描述可能很重要。ydata-profiling允许用这些信息来补充报告。受到schema.org的数据集启发目前支持的属性有description、creator、author、url、copyright_year和copyright_holder。以下示例展示了如何生成一个包含描述、版权持有人、版权年份、创作者和URL的报告。在生成的报告中这些属性将出现在概述部分的“关于”下面。 report df.profile_report(titleMasked data,dataset{description: This profiling report was generated using a sample of 5% of the original dataset.,copyright_holder: StataCorp LLC,copyright_year: 2020,url: http://www.stata-press.com/data/r15/auto2.dta,}, )report.to_file(Path(stata_auto_report.html))除了提供数据集的详细信息外用户在与团队成员和利益相关者分享报告时通常希望包含针对每列的具体描述。ydata-profiling支持创建这些描述以便报告中包含内置的数据字典。默认情况下这些描述会在报告的概述部分中呈现在每个变量旁边显示。 profile df.profile_report(variables{descriptions: {files: Files in the filesystem, # variable name: variable description,datec: Creation date,datem: Modification date,}} )profile.to_file(report.html)另外列描述可以从一个JSON文件中加载 {column name 1: column 1 definition,column name 2: column 2 definition}import json import pandas as pd import ydata_profilingdefinition_file dataset_column_definition.json# Read the variable descriptions with open(definition_file, r) as f:definitions json.load(f)# By default, the descriptions are presented in the Overview section, next to each variable report df.profile_report(variable{descriptions: definitions})# We can disable showing the descriptions next to each variable report df.profile_report(variable{descriptions: definitions}, show_variable_descriptionFalse )report.to_file(report.html)除了提供数据集的详细信息用户通常还希望包含设置类型模式。当将ydata-profiling生成与数据目录中已有的信息集成时这一点尤为重要。当使用ydata-profiling的ProfileReport时用户可以设置type_schema属性来控制生成的数据类型分析。默认情况下type_schema会通过visions自动推断。 import json import pandas as pdfrom ydata_profiling import ProfileReport from ydata_profiling.utils.cache import cache_filefile_name cache_file(titanic.csv,https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv, ) df pd.read_csv(file_name)type_schema {Survived: categorical, Embarked: categorical}# We can set the type_schema only for the variables that we are certain of their types. All the other will be automatically inferred. report ProfileReport(df, titleTitanic EDA, type_schematype_schema)report.to_file(report.html)5. 自定义报告的外观在某些情况下用户可能希望根据个人喜好或公司品牌来自定义报告的外观。ydata-profiling提供了两个主要的自定义方面HTML报告的样式和其中包含的可视化和图表的样式 5.1 自定义报告的主题报告的多个方面都可以进行自定义。下表显示了可用的设置参数类型默认描述html.minify_htmlboolTrue如果为True则使用htmlmin包对输出的HTML进行最小化处理。html.use_local_assetsboolTrue如果为True则所有资源样式表、脚本、图片将被存储在本地。如果为False则使用CDN来提供部分样式表和脚本。html.inlinebooleanTrue如果为True则所有资源都包含在报告中。如果为False则创建一个Web导出其中所有资源都存储在“[REPORT_NAME]_assets/”目录中。html.navbar_showbooleanTrue是否在报告中包含导航栏。html.style.themestringNone选择开机自检主题。可选项平坦深色和团结橙色html.style.logostringbase64 编码的徽标显示在导航栏中html.style.primary_colorstring#337ab7报告中使用的主色调。html.style.full_widthbooleanFalse默认情况下报告的宽度是固定的。如果设置为 “True”则使用屏幕全宽。向底层 matplotlib 可视化引擎传递参数的一种方法是在计算剖面图时使用 plot 参数。可以使用关键对 image_format “png”并使用 dpi: 800 更改图像的分辨率。举例如下 profile ProfileReport(planets,titlePandas Profiling Report,explorativeTrue,plot{dpi: 200, image_format: png}, )饼图用于绘制分类或布尔特征中的类别频率。默认情况下如果一个特征的独特值不超过 10 个则该特征被视为分类特征。这个阈值可以通过 plot.pie.max_unique 设置来配置。如果特征未被视为分类特征则不会显示饼图。因此可以通过设置plot.pie.max_unique 0 来删除所有饼图。饼图的颜色可以通过 plot.pie.colors 设置配置为任何可识别的 matplotlib 颜色。 profile ProfileReport(pd.DataFrame([1, 2, 3])) profile.config.plot.pie.colors [gold, b, #FF796C]相关矩阵和缺失值概览等可视化工具中使用的调色板也可以通过 plot 参数进行自定义。要自定义相关矩阵使用的调色板请使用相关键 from ydata_profiling import ProfileReportprofile ProfileReport(df,titlePandas Profiling Report,explorativeTrue,plot{correlation: {cmap: RdBu_r, bad: #000000}}, )同样缺失值的调色板也可以使用missing参数来更改 from ydata_profiling import ProfileReportprofile ProfileReport(df,titlePandas Profiling Report,explorativeTrue,plot{missing: {cmap: RdBu_r}}, )

查看全文

http://www.hkea.cn/news/14355068/