如何ps做网站首页,受欢迎的昆明网站推广,台州网站建设优化案例,网站建设的具体方法Applied Spatial Statistics#xff08;七#xff09;#xff1a;Python 中的空间回归
本笔记本演示了如何使用 pysal 的 spreg 库拟合空间滞后模型和空间误差模型。
OLS空间误差模型空间滞后模型三种模型的比较探索滞后模型中的直接和间接影响
import numpy as np
impor…Applied Spatial Statistics七Python 中的空间回归
本笔记本演示了如何使用 pysal 的 spreg 库拟合空间滞后模型和空间误差模型。
OLS空间误差模型空间滞后模型三种模型的比较探索滞后模型中的直接和间接影响
import numpy as np
import pandas as pdimport geopandas as gpd
import seaborn as sns
import matplotlib.pyplot as pltfrom libpysal.weights import Queen
from splot.esda import plot_moran
from esda.moran import Moran
import spreg1.数据
在此笔记本中我将使用 2020 年美国总统选举数据集进行演示。 voting 数据框包含县级投票给民主党的人数百分比编码为 new_pct_dem以及该县的一些社会经济变量。该数据集仅包含美国本土 48 个州的统计数据。
voting pd.read_csv(https://raw.github.com/Ziqi-Li/gis5122/master/data/voting_2020.csv)voting[[median_income]] voting[[median_income]]/10000voting.head()county_idstatecountyNAMEproj_Xproj_Ytotal_popnew_pct_demsex_ratiopct_black...median_incomepct_65_overpct_age_18_29ginipct_manufln_pop_denpct_3rd_partyturn_outpct_fbpct_uninsured0170511751Fayette County, Illinois597979.55311796861.9932156518.445122113.64.7...4.665018.814.8991420.437314.93.3927151.92365258.9309841.38.211710717107Logan County, Illinois559814.67661920479.9752900329.42003097.26.9...5.730818.017.2568360.420112.43.8472242.33285056.6315521.64.521716517165Saline County, Illinois650278.35791660709.8082399425.60191196.92.6...4.409019.913.5867300.46928.74.1286541.77813959.1479371.04.23170971797Lake County, Illinois654010.92622174576.60570147362.27588899.86.8...8.942713.715.8231320.484716.37.3082011.95417771.15197518.76.841712717127Massac County, Illinois640398.98631599902.4911421925.62611889.55.8...4.748120.812.3707720.40977.44.0677881.39644362.2814251.05.4
5 rows × 22 columns
然后我们阅读了美国的县边界文件。
shp gpd.read_file(https://raw.github.com/Ziqi-Li/gis5122/master/data/us_counties.geojson)#Merge the shapefile with the voting data by the common county_id
shp_voting shp.merge(voting, on county_id)#Dissolve the counties to obtain boundary of states, used for mapping
state shp_voting.dissolve(bySTATEFP).geometry.boundary选择本练习中要使用的变量我从列表中选择了 6 个预测因子。
variable_names [sex_ratio, pct_black, pct_hisp,pct_bach, median_income,ln_pop_den]y shp_voting[[new_pct_dem]].valuesX shp_voting[variable_names].values2.OLS model (baseline)
这里我演示如何使用 spring 来拟合 OLS 模型。当然你也可以使用 statsmodels。
#In the spreg.OLS() you need to specify the y and X, also variable names (optional)ols spreg.OLS(y, X, name_ynew_pct_dem, name_xvariable_names)print(ols.summary)REGRESSION RESULTS
------------------SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set : unknown
Weights matrix : None
Dependent Variable : new_pct_dem Number of Observations: 3103
Mean dependent var : 33.7616 Number of Variables : 7
S.D. dependent var : 16.2257 Degrees of Freedom : 3096
R-squared : 0.6091
Adjusted R-squared : 0.6083
Sum squared residual: 319249 F-statistic : 803.9833
Sigma-square : 103.117 Prob(F-statistic) : 0
S.E. of regression : 10.155 Log likelihood : -11592.001
Sigma-square ML : 102.884 Akaike info criterion : 23198.003
S.E of regression ML: 10.1432 Schwarz criterion : 23240.284------------------------------------------------------------------------------------Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------CONSTANT 5.83676 1.96534 2.96984 0.00300sex_ratio 0.00613 0.01754 0.34970 0.72659pct_black 0.48310 0.01401 34.47681 0.00000pct_hisp 0.23952 0.01329 18.02612 0.00000pct_bach 0.97537 0.02854 34.17057 0.00000median_income -1.66008 0.19755 -8.40329 0.00000ln_pop_den 2.13283 0.12925 16.50160 0.00000
------------------------------------------------------------------------------------REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 33.073TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 2 1166.636 0.0000DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 6 268.617 0.0000
Koenker-Bassett test 6 118.745 0.0000END OF REPORT 我们可以与 statsmodels 进行比较结果相同
查看 OLS 残差图
ols.uarray([[ 0.98102642],[-7.61122909],[ 3.44911401],...,[-3.82424613],[-0.7422498 ],[-1.11507546]])from matplotlib import colors#For creating a discrete color classification
norm colors.BoundaryNorm([-20, -10, -5, 0, 5, 10, 20],ncolors256)ax shp_voting.plot(columnols.u.reshape(-1),legendTrue,figsize(15,8), normnorm, linewidth0.0)state.plot(axax,linewidth0.3,edgecolorblack)plt.title(Map of residuals of the OLS model,fontsize15)Text(0.5, 1.0, Map of residuals of the OLS model)从 OLS 残差图中我们可以看到空间自相关性很强高/低残差聚集在一起。这强烈表明我们的模型缺少空间结构并且违反了 OLS 的独立性假设。
然后让我们通过计算残差的 Moran’s I 来更定量地评估空间自相关性。
#Here we use the Queen contiguity
w Queen.from_dataframe(shp_voting)#row standardization
w.transform R#The warning is saying there are two counties without neighbors, lets dont worry about this for now.ipython-input-14-9c7ca81e50b6:2: FutureWarning: use_index defaults to False but will default to True in future. Set True/False directly to control this behavior and silence this warningw Queen.from_dataframe(shp_voting)(WARNING: , 2441, is an island (no neighbors))
(WARNING: , 2701, is an island (no neighbors))/usr/local/lib/python3.10/dist-packages/libpysal/weights/weights.py:224: UserWarning: The weights matrix is not fully connected: There are 3 disconnected components.There are 2 islands with ids: 2441, 2701.warnings.warn(message)#Here, lets calculate the Morans I value, and plot it.
#ols.u is the residuals from the OLS modelols_moran Moran(ols.u, w, permutations 199) #199 permutationsplot_moran(ols_moran, figsize(10,4))我们发现 Moran’s I 值等于 0.6这让我们确信 OLS 残差图上确实存在很强的空间模式。
现在有两个选择我们可以使用 滞后模型或者我们可以使用 误差模型。
一个便利之处在于如果您将权重矩阵传递给 OLS 函数同时指定 spat_diagTrue那么您将获得一些额外的空间诊断可以帮助您做出决定。如果您指定 moranTrue这还包括残差的 Moran’s I。
ols spreg.OLS(y, X, ww, spat_diagTrue, moranTrue,name_ypct_dem, name_xvariable_names)print(ols.summary)REGRESSION RESULTS
------------------SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : pct_dem Number of Observations: 3103
Mean dependent var : 33.7616 Number of Variables : 7
S.D. dependent var : 16.2257 Degrees of Freedom : 3096
R-squared : 0.6091
Adjusted R-squared : 0.6083
Sum squared residual: 319249 F-statistic : 803.9833
Sigma-square : 103.117 Prob(F-statistic) : 0
S.E. of regression : 10.155 Log likelihood : -11592.001
Sigma-square ML : 102.884 Akaike info criterion : 23198.003
S.E of regression ML: 10.1432 Schwarz criterion : 23240.284------------------------------------------------------------------------------------Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------CONSTANT 5.83676 1.96534 2.96984 0.00300sex_ratio 0.00613 0.01754 0.34970 0.72659pct_black 0.48310 0.01401 34.47681 0.00000pct_hisp 0.23952 0.01329 18.02612 0.00000pct_bach 0.97537 0.02854 34.17057 0.00000median_income -1.66008 0.19755 -8.40329 0.00000ln_pop_den 2.13283 0.12925 16.50160 0.00000
------------------------------------------------------------------------------------REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 33.073TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 2 1166.636 0.0000DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 6 268.617 0.0000
Koenker-Bassett test 6 118.745 0.0000DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST MI/DF VALUE PROB
Morans I (error) 0.6000 56.164 0.0000
Lagrange Multiplier (lag) 1 1952.063 0.0000
Robust LM (lag) 1 18.792 0.0000
Lagrange Multiplier (error) 1 3119.345 0.0000
Robust LM (error) 1 1186.074 0.0000
Lagrange Multiplier (SARMA) 2 3138.137 0.0000 END OF REPORT 从 L-M 检验中我们可以预期误差模型比滞后模型更为合适比较稳健得分1186 比 18。 3.Spatial Error Model (SEM)
现在让我们使用“spreg.ML_Error()”拟合空间错误模型其中您需要指定 y、X 和权重矩阵 w。
sem spreg.ML_Error(y, X, ww, name_xvariable_names, name_ynew_pct_dem)print(sem.summary)/usr/local/lib/python3.10/dist-packages/scipy/optimize/_minimize.py:913: RuntimeWarning: Method bounded does not support relative tolerance in x; defaulting to absolute tolerance.warn(Method bounded does not support relative tolerance in x; REGRESSION RESULTS
------------------SUMMARY OF OUTPUT: ML SPATIAL ERROR (METHOD full)
---------------------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : new_pct_dem Number of Observations: 3103
Mean dependent var : 33.7616 Number of Variables : 7
S.D. dependent var : 16.2257 Degrees of Freedom : 3096
Pseudo R-squared : 0.5167
Log likelihood : -10170.5905
Sigma-square ML : 33.4938 Akaike info criterion : 20355.181
S.E of regression : 5.7874 Schwarz criterion : 20397.462------------------------------------------------------------------------------------Variable Coefficient Std.Error z-Statistic Probability
------------------------------------------------------------------------------------CONSTANT 19.00473 1.49728 12.69280 0.00000sex_ratio -0.04056 0.00988 -4.10353 0.00004pct_black 0.74856 0.01685 44.43261 0.00000pct_hisp 0.31866 0.01734 18.37383 0.00000pct_bach 0.72854 0.02000 36.42952 0.00000median_income -2.42279 0.15714 -15.41832 0.00000ln_pop_den 1.83430 0.13375 13.71399 0.00000lambda 0.86984 0.00994 87.51130 0.00000
------------------------------------------------------------------------------------END OF REPORT 空间滞后误差项的 lambda或其他使用 rho 的软件或符号系数非常显著并且其幅度相当大这表明残差中存在很强的空间自相关性这被滞后误差项捕获。
请注意sem 的 sem.e_filtered 属性应该是 iid 误差。而 sem.u 是自回归误差 iid 误差。现在让我们再次查看残差的 Moran’s I。
sem.e_filteredarray([[-2.82249844],[-2.93425648],[ 1.76293602],...,[ 0.58894001],[ 4.25853052],[-4.82251595]])sem_moran Moran(sem.e_filtered, w, permutations 199) #199 permutations
plot_moran(sem_moran, zstandardTrue, figsize(10,4))非常低的 Moran’s I - 随机
ax shp_voting.plot(columnsem.e_filtered.reshape(-1),legendTrue,figsize(15,8), normnorm, linewidth0.0)
state.plot(axax,linewidth0.3,edgecolorblack)
plt.title(Map of filtered residuals of the SEM model,fontsize15)
Text(0.5, 1.0, Map of filtered residuals of the SEM model)随机模式太棒了 4.Spatial Lag Model
类似地让我们将此重复到空间滞后模型
slm spreg.ML_Lag(y, X, ww, name_ynew_pct_dem, name_xvariable_names)print(slm.summary)REGRESSION RESULTS
------------------SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL LAG (METHOD FULL)
-----------------------------------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : new_pct_dem Number of Observations: 3103
Mean dependent var : 33.7616 Number of Variables : 8
S.D. dependent var : 16.2257 Degrees of Freedom : 3095
Pseudo R-squared : 0.7744
Spatial Pseudo R-squared: 0.5591
Log likelihood : -10853.5647
Sigma-square ML : 59.4991 Akaike info criterion : 21723.129
S.E of regression : 7.7136 Schwarz criterion : 21771.450------------------------------------------------------------------------------------Variable Coefficient Std.Error z-Statistic Probability
------------------------------------------------------------------------------------CONSTANT 1.07097 1.50321 0.71245 0.47618sex_ratio 0.01044 0.01333 0.78274 0.43378pct_black 0.27765 0.01295 21.43381 0.00000pct_hisp 0.16619 0.01060 15.68552 0.00000pct_bach 0.83467 0.02200 37.93947 0.00000median_income -2.99112 0.15043 -19.88429 0.00000ln_pop_den 1.47061 0.10082 14.58655 0.00000W_new_pct_dem 0.58112 0.01342 43.31362 0.00000
------------------------------------------------------------------------------------END OF REPORT 空间滞后项“W_new_pct_dem”的 rho 系数显著且幅度很大这表明因变量具有很强的空间溢出效应。
slm_moran Moran(slm.u, w, permutations 199) #199 permutations
plot_moran(slm_moran, zstandardTrue, figsize(10,4))ax shp_voting.plot(columnslm.u.reshape(-1),legendTrue,figsize(15,8), normnorm, linewidth0.0)state.plot(axax,linewidth0.3,edgecolorblack)
plt.title(Map of residuals of the spatial lag model,fontsize15)
Text(0.5, 1.0, Map of residuals of the spatial lag model)5.滞后、误差和 OLS 模型的交叉比较。
总体而言我们看到尽管有一些变化例如在 SEM 模型中%black 的影响更大但估计值是一致的。OLS 模型不可靠因为我们知道假设被违反了。在滞后模型中即使我们考虑了邻近投票偏好残差仍然显示出一些弱自相关性。而在误差模型中我们确实观察到了随机残差。
所以如果我需要做出决定我会使用误差模型。这也得到了 LM 测试的证据以及误差模型具有最低 AIC 值的支持。
PredictorOLS EstimatesSLM EstimatesSEM EstimatesCONSTANT5.83*1.0719.00*sex_ratio0.000.01-0.04*pct_black0.48*0.27*0.74*pct_hisp0.23*0.16*0.31*pct_bach0.97*0.83*0.72*median_income-1.66*-2.99*-2.42*ln_pop_den2.13*1.47*1.83*lambdaNA0.58*0.86*AIC23198.0021723.1220355.18Moran’s I of residuals0.600.15-0.08
6.更多关于 SLM 模型的内容间接影响的检查。
场景如果莱昂的 bach 百分比增加 1%会怎样附近县的 dem 百分比会发生什么变化
步骤
使用 w.full() 获取完整的 n x n 矩阵计算 (I-pW)^-1*beta此处的估计值是 SLM 模型中的 bach 百分比因此为 0.83现在您获得了完整的 n x n 变化交互。找到任何感兴趣的县的行索引。现在您可以选择该县的列并检查这将如何影响其他县。
#1.
w.full()[0]array([[0., 0., 0., ..., 0., 0., 0.],[0., 0., 0., ..., 0., 0., 0.],[0., 0., 0., ..., 0., 0., 0.],...,[0., 0., 0., ..., 0., 0., 0.],[0., 0., 0., ..., 0., 0., 0.],[0., 0., 0., ..., 0., 0., 0.]])np.identity(3103)array([[1., 0., 0., ..., 0., 0., 0.],[0., 1., 0., ..., 0., 0., 0.],[0., 0., 1., ..., 0., 0., 0.],...,[0., 0., 0., ..., 1., 0., 0.],[0., 0., 0., ..., 0., 1., 0.],[0., 0., 0., ..., 0., 0., 1.]])#2.
effects np.linalg.inv(np.identity(3103) - 0.58*w.full()[0])*0.83 #n3103, rho0.58, est_pct_bach 0.83effectsarray([[8.92141250e-01, 1.32877317e-03, 1.52996497e-14, ...,3.14050249e-31, 5.12287114e-06, 1.79564935e-10],[9.49123691e-04, 8.98379566e-01, 1.43986921e-13, ...,4.16193409e-29, 1.14493906e-03, 1.32924287e-10],[1.27497081e-14, 1.67984741e-13, 9.03375896e-01, ...,8.63302503e-23, 5.33348558e-13, 2.63149009e-21],...,[2.61708541e-31, 4.85558978e-29, 8.63302503e-23, ...,8.88070682e-01, 8.55288372e-27, 1.32652812e-25],[3.65919367e-06, 1.14493906e-03, 4.57155907e-13, ...,7.33104319e-27, 8.95542764e-01, 4.10759600e-10],[1.12228084e-10, 1.16308751e-10, 1.97361756e-21, ...,9.94896087e-26, 3.59414650e-10, 9.05420698e-01]])#3. find the row index for Leon, which is 67.
shp_voting[shp_voting[NAME_x] Leon]GEOIDSTATEFPNAME_xcounty_idgeometrystatecountyNAME_yproj_Xproj_Y...median_incomepct_65_overpct_age_18_29ginipct_manufln_pop_denpct_3rd_partyturn_outpct_fbpct_uninsured671207312Leon12073POLYGON ((1080730.82089 870592.41110, 1086551....1273Leon County, Florida1.120705e06889449.6751...5.310612.930.0407220.48962.06.0235681.20236471.9104746.88.110504828948Leon48289POLYGON ((-30255.42040 920246.79226, -30598.62...48289Leon County, Texas3.831304e02913423.2253...4.304524.212.3425250.52715.72.7675770.89944668.0744175.217.0
2 rows × 26 columns #Total effects for Leon can be obtained in the diagnoal of the full effects matrix.effects[67,67]0.903899945968712这表明大学毕业生人数每增加 1%民主党的投票份额将增加约 0.90%直接 间接。请注意这大于滞后模型的系数即 0.83该系数仅捕捉直接影响。 间接影响是其自身与邻居之间的空间相互作用的结果约为 0.07%0.90 - 0.83
现在让我们来看看莱昂的变化如何影响周边县市。
#get the effects for leon and plot it
shp_voting[d_pct_bach_leon] effects[:,67]ax shp_voting[shp_voting[state] 12].plot(columnd_pct_bach_leon,legendTrue,figsize(15,8), linewidth0.0,aspect1)plt.title(% increase in Dem share if Leon has \n1% more college graduates,fontsize15,y1.08)Text(0.5, 1.08, % increase in Dem share if Leon has \n1% more college graduates)shp_voting[(shp_voting[state] 12) (shp_voting[NAME_x] Jefferson)].d_pct_bach_leon2808 0.121902
Name: d_pct_bach_leon, dtype: float64因此我们基本上可以看出如果莱昂的大学毕业生人数增加 1%预计附近县的 %dem 份额将增加约 0.12%。例如受莱昂变化的影响杰斐逊县的 dem 份额可能会增加 0.12%。
shp_voting[(shp_voting[state] 12) (shp_voting[NAME_x] Miami-Dade)].d_pct_bach_leon2607 1.163942e-08
Name: d_pct_bach_leon, dtype: float64然而我们可以看到对于迈阿密戴德等较远的县间接溢出效应基本为零。这是因为效应的幅度 (rho) 以及指定的 W 矩阵非常局部。