当前位置：首页 > news >正文

优秀的网站建设开发案例设计网名的花样符号

news 2026/4/22 4:19:42

优秀的网站建设开发案例,设计网名的花样符号,福永网站建设公司有没有,石家庄区号1 BatchNormBN的原理BN是计算机视觉最常用的标准化方法#xff0c;它沿着N、H、W维度对输入特征图求均值和方差#xff0c;随后再利用均值和方差来归一化特征图。计算过程如下图所示#xff0c;1#xff09;沿着通道维度计算其他维度的均值#xff1b;2#xff09;沿着通…1 BatchNormBN的原理BN是计算机视觉最常用的标准化方法它沿着N、H、W维度对输入特征图求均值和方差随后再利用均值和方差来归一化特征图。计算过程如下图所示1沿着通道维度计算其他维度的均值2沿着通道维度计算其他维度的方差3归一化特征图4加入可学习参数γ和β(在每次反向传播后更新)对归一化的特征图进行包含缩放和平移的仿射操作pytorch中的BN有三种torch.nn.BatchNorm1d、torch.nn.BatchNorm2d、torch.nn.BatchNorm3d。这里拿torch.nn.BatchNorm2d来举例它的参数如下Args:num_features: 输入特征通道数eps: 为保证数值稳定性(分母不能趋近或取0), 给分母加上的值, 默认值是1e-5momentum: 计算running_mean和running_var时使用的动量(指数平均因子), 默认值是0.1affine: 布尔值, 是否给BN层添加仿射变换的可学习参数γ和β, 默认为Truetrack_running_stats: 布尔值, 是否记录训练中的running mean和variance, 若为False, 则该BN层在训练和验证阶段都只在当前输入中统计mean和variance, 如果此时的batch_size比较小, 那么其统计特性就会和全局统计特性有着较大偏差可能导致糟糕的效果. 默认值为True更新running_mean和running_var的公式其中为模型更新前的running_mean或running_var为此次输入的mean或者var。在验证时(model.eval())running_mean和running_var被视为均值和方差来标准化输入张量。BN的优点BN使得网络中每层输入数据的分布相对稳定(可以使用较大的学习率)不仅极大提升了训练速度收敛过程大大加快BN使得模型对网络中的参数不那么敏感减弱对初始化的强依赖性简化调参过程使得网络学习更加稳定BN允许网络使用饱和性激活函数(例如sigmoid等)归一化后的数据能让梯度维持在比较大的值和变化率缓解梯度消失或者爆炸有轻微的正则化作用(相当于给隐藏层加入噪声类似Dropout能缓解过拟合。BN的缺点对batchsize的大小比较敏感。如果batchsize太小则计算的均值、方差不足以代表整个数据分布。小的bathsize引入的随机性更大难以达到收敛不适合于RNN、风格迁移等任务。拿风格迁移举例由于Mini-Batch内可能存在多张无关的图片去计算这些图片的均值和方差会弱化单张图片本身特有的一些细节信息。代码实例1随机初始化输入张量和实例化BNimport torch import torch.nn as nn# 固定随机种子, 使随机生成的input每次都一样 torch.manual_seed(42) # 随机生成形状为[1,2,2,2]输入 input torch.randn((1,2,2,2)).cuda() print(input:, input)# 实例化BN bn nn.BatchNorm2d(num_features2, eps0.00001, momentum0.1, affineTrue, track_running_statsTrue).cuda() bn.running_mean (torch.ones([2])*2).cuda() bn.running_var (torch.ones([2])*1).cuda() bn.train() # 查看模型更新前的参数 print(trainning:, bn.training) print(running_mean:, bn.running_mean) print(running_var:, bn.running_var) print(weight:, bn.weight) # γ, 初始值为1 print(bias:, bn.bias) # β, 初始值为0# 打印结果input: tensor([[[[ 0.3367, 0.1288],[ 0.2345, 0.2303]],[[-1.1229, -0.1863],[ 2.2082, -0.6380]]]], devicecuda:0) trainning: True running_mean: tensor([2., 2.], devicecuda:0) running_var: tensor([1., 1.], devicecuda:0) weight: Parameter containing: tensor([1., 1.], devicecuda:0, requires_gradTrue) bias: Parameter containing: tensor([0., 0.], devicecuda:0, requires_gradTrue) 2经过BN层获取输出结果# 输出 output bn(input) print(output:, output)# 查看模型更新后的参数 print(trainning:, bn.training) print(running_mean:, bn.running_mean) print(running_var:, bn.running_var) print(weight:, bn.weight) print(bias:, bn.bias)# 打印结果, 由于没有反向传播, 所以γ和β值不变output: tensor([[[[ 1.4150, -1.4102],[ 0.0257, -0.0305]],[[-0.9276, -0.1964],[ 1.6731, -0.5491]]]], devicecuda:0,grad_fnCudnnBatchNormBackward0) trainning: True running_mean: tensor([1.8233, 1.8065], devicecuda:0) running_var: tensor([0.9007, 1.1187], devicecuda:0) weight: Parameter containing: tensor([1., 1.], devicecuda:0, requires_gradTrue) bias: Parameter containing: tensor([0., 0.], devicecuda:0, requires_gradTrue) 3根据BN的原理自己写一段归一化代码# 计算输入数据的均值和方差. 注意, torch.var()函数中unbiased默认为True表示方差的无偏估计这里需将它设为False cur_mean torch.mean(input, dim[0,2,3]) cur_var torch.var(input, dim[0,2,3], unbiasedFalse) print(cur_mean:, cur_mean) print(cur_var:, cur_var)# 计算running_mean和running_var new_mean (torch.ones([2])*2) * (1-bn.momentum) cur_mean * bn.momentum new_var (torch.ones([2])*1) * (1-bn.momentum) cur_var * bn.momentum print(new_mean:, new_mean) print(new_var:, new_var)# 打印结果, 可以看到, 计算出的new_mean和new_var与步骤2的running_mean和running_var一致cur_mean: tensor([0.2326, 0.0653]) cur_var: tensor([0.0072, 2.1872]) new_mean: tensor([1.8233, 1.8065]) new_var: tensor([0.9007, 1.1187]) # 计算输出结果, 训练时用当前数据的mean和方差做标准化, 验证时用running_mean和running_var做标准化 output2 (input - cur_mean) / torch.sqrt(cur_var bn.eps) print(output2:, output2)# 打印结果, 可以看到, 计算出的output2与步骤2的output一致output2: tensor([[[[ 1.4150, -1.4102],[ 0.0257, -0.0305]],[[-0.9276, -0.1964],[ 1.6731, -0.5491]]]]) 2 SyncBatchNormBN的效果与batchsize的大小有很大关系。而像目标检测、语义分割这些任务占用显存较高每张卡分到的图片数就会变少而在DP模式下每张卡只能拿到自己那部分的计算结果。为了在验证或者测试模型时使用相同的running_mean和running_varDP模式便只拿主卡上计算的均值和方差去更新running_mean和running_varBN的效果自然就会变差。一个解决思路就是用SyncBN代替BN使用全局的BN统计量来标准化输入相比于单卡的BN统计量全局的BN统计量会更准确。SyncBatchNorm的原理本小节的两张图片来自https://cloud.tencent.com/developer/article/21268381计算各张卡的均值和方差2同步各卡之间的均值和方差利用torch.distributed.all_gather函数收集各GPU上的均值和方差得到全局的均值和方差更新running_mean和running_var3标准化输入该过程与BN类似。SyncBN源码import torch from torch.autograd.function import Functionclass SyncBatchNorm(Function):staticmethoddef forward(self, input, weight, bias, running_mean, running_var, eps, momentum, process_group, world_size):input input.contiguous()size input.numel() // input.size(1)if size 1:raise ValueError(Expected more than 1 value per channel when training, got input size {}.format(size))count torch.Tensor([size]).to(input.device)# calculate mean/invstd for input.mean, invstd torch.batch_norm_stats(input, eps)count_all torch.empty(world_size, 1, dtypecount.dtype, devicecount.device)mean_all torch.empty(world_size, mean.size(0), dtypemean.dtype, devicemean.device)invstd_all torch.empty(world_size, invstd.size(0), dtypeinvstd.dtype, deviceinvstd.device)count_l list(count_all.unbind(0))mean_l list(mean_all.unbind(0))invstd_l list(invstd_all.unbind(0))# using all_gather instead of all reduce so we can calculate count/mean/var in one gocount_all_reduce torch.distributed.all_gather(count_l, count, process_group, async_opTrue)mean_all_reduce torch.distributed.all_gather(mean_l, mean, process_group, async_opTrue)invstd_all_reduce torch.distributed.all_gather(invstd_l, invstd, process_group, async_opTrue)# wait on the async communication to finishcount_all_reduce.wait()mean_all_reduce.wait()invstd_all_reduce.wait()# calculate global mean invstdmean, invstd torch.batch_norm_gather_stats_with_counts(input,mean_all,invstd_all,running_mean,running_var,momentum,eps,count_all.view(-1).long().tolist())self.save_for_backward(input, weight, mean, invstd, count_all)self.process_group process_group# apply element-wise normalizationout torch.batch_norm_elemt(input, weight, bias, mean, invstd, eps)return outstaticmethoddef backward(self, grad_output):grad_output grad_output.contiguous()saved_input, weight, mean, invstd, count_tensor self.saved_tensorsgrad_input grad_weight grad_bias Noneprocess_group self.process_group# calculate local stats as well as grad_weight / grad_biassum_dy, sum_dy_xmu, grad_weight, grad_bias torch.batch_norm_backward_reduce(grad_output,saved_input,mean,invstd,weight,self.needs_input_grad[0],self.needs_input_grad[1],self.needs_input_grad[2])if self.needs_input_grad[0]:# synchronizing stats used to calculate input gradient.# TODO: move div_ into batch_norm_backward_elemt kernelsum_dy_all_reduce torch.distributed.all_reduce(sum_dy, torch.distributed.ReduceOp.SUM, process_group, async_opTrue)sum_dy_xmu_all_reduce torch.distributed.all_reduce(sum_dy_xmu, torch.distributed.ReduceOp.SUM, process_group, async_opTrue)# wait on the async communication to finishsum_dy_all_reduce.wait()sum_dy_xmu_all_reduce.wait()divisor count_tensor.sum()mean_dy sum_dy / divisormean_dy_xmu sum_dy_xmu / divisor# backward pass for gradient calculationgrad_input torch.batch_norm_backward_elemt(grad_output,saved_input,mean,invstd,weight,mean_dy,mean_dy_xmu)# synchronizing of grad_weight / grad_bias is not needed as distributed# training would handle all reduce.if weight is None or not self.needs_input_grad[1]:grad_weight Noneif weight is None or not self.needs_input_grad[2]:grad_bias Nonereturn grad_input, grad_weight, grad_bias, None, None, None, None, None, NoneSyncBN的使用注意SyncBN需要在DDP环境初始化后初始化但是要在DDP模型之前完成初始化。import torch from torch import distributeddistributed.init_process_group(backendnccl) model torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) model torch.nn.parallel.DistributedDataParallel(model)classmethod def convert_sync_batchnorm(cls, module, process_groupNone):module_output moduleif isinstance(module, torch.nn.modules.batchnorm._BatchNorm):module_output torch.nn.SyncBatchNorm(module.num_features,module.eps,module.momentum,module.affine,module.track_running_stats,process_group,)if module.affine:with torch.no_grad():module_output.weight module.weightmodule_output.bias module.biasmodule_output.running_mean module.running_meanmodule_output.running_var module.running_varmodule_output.num_batches_tracked module.num_batches_trackedif hasattr(module, qconfig):module_output.qconfig module.qconfigfor name, child in module.named_children():module_output.add_module(name, cls.convert_sync_batchnorm(child, process_group))del modulereturn module_output3 InstanceNormIN的原理BN注重对batchsize数据归一化但是在图像风格化任务中生成的风格结果主要依赖于某个图像实例所以对整个batchsize数据进行归一化不合适因此提出了IN只对HW维度进行归一化IN保留了N、C的维度。计算过程如下图所示1沿着H、W维度对输入张量求均值和方差2利用求得的均值和方差来标准化输入张量3加入可学习参数γ和β对标准化后的数据做仿射变换IN的使用torch.nn.InstanceNorm2d(num_features, eps1e-05, momentum0.1, affineTrue, track_running_statsTrue)class InstanceNorm2d(_InstanceNorm):def _get_no_batch_dim(self):return 3def _check_input_dim(self, input):if input.dim() not in (3, 4):raise ValueError(expected 3D or 4D input (got {}D input).format(input.dim()))class _InstanceNorm(_NormBase):def __init__(self,num_features: int,eps: float 1e-5,momentum: float 0.1,affine: bool False,track_running_stats: bool False,deviceNone,dtypeNone) - None:factory_kwargs {device: device, dtype: dtype}super(_InstanceNorm, self).__init__(num_features, eps, momentum, affine, track_running_stats, **factory_kwargs)def _check_input_dim(self, input):raise NotImplementedErrordef _get_no_batch_dim(self):raise NotImplementedErrordef _handle_no_batch_input(self, input):return self._apply_instance_norm(input.unsqueeze(0)).squeeze(0)def _apply_instance_norm(self, input):return F.instance_norm(input, self.running_mean, self.running_var, self.weight, self.bias,self.training or not self.track_running_stats, self.momentum, self.eps)def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,missing_keys, unexpected_keys, error_msgs):version local_metadata.get(version, None)# at version 1: removed running_mean and running_var when# track_running_statsFalse (default)if version is None and not self.track_running_stats:running_stats_keys []for name in (running_mean, running_var):key prefix nameif key in state_dict:running_stats_keys.append(key)if len(running_stats_keys) 0:error_msgs.append(Unexpected running stats buffer(s) {names} for {klass} with track_running_statsFalse. If state_dict is a checkpoint saved before 0.4.0, this may be expected because {klass} does not track running stats by default since 0.4.0. Please remove these keys from state_dict. If the running stats are actually needed, instead set track_running_statsTrue in {klass} to enable them. See the documentation of {klass} for details..format(names and .join({}.format(k) for k in running_stats_keys),klassself.__class__.__name__))for key in running_stats_keys:state_dict.pop(key)super(_InstanceNorm, self)._load_from_state_dict(state_dict, prefix, local_metadata, strict,missing_keys, unexpected_keys, error_msgs)def forward(self, input: Tensor) - Tensor:self._check_input_dim(input)if input.dim() self._get_no_batch_dim():return self._handle_no_batch_input(input)return self._apply_instance_norm(input)IN的优点IN适合于生成式对抗网络的相关任务如风格迁移。图片生成的结果主要依赖于某个图像实例对整个batchsize进行BN操作不适合风格迁移任务在该任务中使用IN不仅可以加速模型收敛并且可以保持每个图像实例之间的独立性不受通道和batchsize的影响。IN的缺点如果要利用到特征图通道之间的相关性不建议使用IN做归一化处理。4 LayerNormLN的原理在NLP任务中比如文本任务不同样本的长度往往不一样使用BN来标准化则不太合理。因此提出了LN对CHW维度进行归一化。计算过程如下图所示1沿着C、H、W维度求输入张量的均值和方差2利用所求得的均值和方差标准化输入3加入可学习参数γ和β对标准化后的数据做仿射变换LN的使用torch.nn.LayerNorm(normalized_shape, eps1e-05, elementwise_affineTrue)class LayerNorm(Module):__constants__ [normalized_shape, eps, elementwise_affine]normalized_shape: Tuple[int, ...]eps: floatelementwise_affine: booldef __init__(self, normalized_shape: _shape_t, eps: float 1e-5, elementwise_affine: bool True,deviceNone, dtypeNone) - None:factory_kwargs {device: device, dtype: dtype}super(LayerNorm, self).__init__()if isinstance(normalized_shape, numbers.Integral):# mypy error: incompatible types in assignmentnormalized_shape (normalized_shape,) # type: ignore[assignment]self.normalized_shape tuple(normalized_shape) # type: ignore[arg-type]self.eps epsself.elementwise_affine elementwise_affineif self.elementwise_affine:self.weight Parameter(torch.empty(self.normalized_shape, **factory_kwargs))self.bias Parameter(torch.empty(self.normalized_shape, **factory_kwargs))else:self.register_parameter(weight, None)self.register_parameter(bias, None)self.reset_parameters()def reset_parameters(self) - None:if self.elementwise_affine:init.ones_(self.weight)init.zeros_(self.bias)def forward(self, input: Tensor) - Tensor:return F.layer_norm(input, self.normalized_shape, self.weight, self.bias, self.eps)def extra_repr(self) - str:return {normalized_shape}, eps{eps}, \elementwise_affine{elementwise_affine}.format(**self.__dict__)LN的优点LN不需要批量训练。在单条数据内部就能完成归一化操作因此可以用于batchsize1和RNN的训练中效果比BN更优。不同的输入样本有不同的均值和方差可以更快、更好地达到最优效果。LN不需要保存batchsize的均值和方差节省了额外的存储空间。LN的缺点LN与batchsize无关在小batchsize上效果可能会比BN好但是在大batchsize的效果还是BN更好。5 GroupNormGN的原理GN是为了解决BN对较小的batchsize效果差的问题它将通道分成num_groupss组每组包含channel/num_groups个通道则特征图变为(N, G, C//G, H, W)然后计算每组(C//G, H, W)维度的均值和方差这样就与batchsize无关。GN的极端情况就是LN和IN分别对应G等于1和G等于C。GN的计算过程如下图所示1沿着C//G、H、W维度计算输入张量的均值和方差2利用所求得的均值和方差标准化输入3加入可学习参数γ和β对标准化后的数据做仿射变换GN的使用torch.nn.GroupNorm(num_groups, num_channels, eps1e-05, affineTrue, deviceNone, dtypeNone)class GroupNorm(Module):__constants__ [num_groups, num_channels, eps, affine]num_groups: intnum_channels: inteps: floataffine: booldef __init__(self, num_groups: int, num_channels: int, eps: float 1e-5, affine: bool True,deviceNone, dtypeNone) - None:factory_kwargs {device: device, dtype: dtype}super(GroupNorm, self).__init__()if num_channels % num_groups ! 0:raise ValueError(num_channels must be divisible by num_groups)self.num_groups num_groupsself.num_channels num_channelsself.eps epsself.affine affineif self.affine:self.weight Parameter(torch.empty(num_channels, **factory_kwargs))self.bias Parameter(torch.empty(num_channels, **factory_kwargs))else:self.register_parameter(weight, None)self.register_parameter(bias, None)self.reset_parameters()def reset_parameters(self) - None:if self.affine:init.ones_(self.weight)init.zeros_(self.bias)def forward(self, input: Tensor) - Tensor:return F.group_norm(input, self.num_groups, self.weight, self.bias, self.eps)def extra_repr(self) - str:return {num_groups}, {num_channels}, eps{eps}, \affine{affine}.format(**self.__dict__)GN的优点GN不依赖于batchsize可以很好适用于RNN这是GN的巨大优势。论文指出G为32或每个group的通道数为16时效果最优在batchsize小于16时GN优于BN。GN的缺点在大batchsize时效果不如BN。6 总结BN对小batchsize的效果不好IN作用在图像像素上适用于风格化迁移LN主要对RNN作用明显GN将channel分组然后再做归一化, 在batchsize16的时候, 效果优于BN。参考文章【博客园】https://www.cnblogs.com/lxp-never/p/11566064.html【知乎】https://zhuanlan.zhihu.com/p/395855181【腾讯云】https://cloud.tencent.com/developer/article/2126838

查看全文

http://www.hkea.cn/news/14363417/