#1andy_marvelous回复您好:1. 网络误差并非逐层增大,因为量化的关系,有一些层的误差会偏大,但是有一些层能够将误差减小。例如,例如mlu上的某层conv有较多的-1到0之间的输出,而cpu上对应层的conv有较多的0到1之间的输出,此时mlu与cpu的误差会变大,若该conv层后接一个relu层,则会将大量-1到0之间的输出变为0,而cpu上对应的0到1的输出将保持原值,此时,误差将减小。2. 同上3. 您这边指的误差是mlu的最后一层的输出与cpu的输出的mse吗?还是指其他?4. 该问题与具体的conv输入有关。同时可参考第1点回复。5. 通常来说,精度不及预期可检查量化效果是否符合预期。mlu使用min-max量化,可以检查异常层的scale 与 pos,然后手动套用min-max量化公式检查pt内的weight是否符合量化预期。
展开 这是我打印出来的网络
(model): Sequential(
(0): Conv(
(conv): Conv2d(3, 32, kernel_size=(6, 6), stride=(2, 2), padding=(2, 2))
(act): SiLU()
)
(1): Conv(
(conv): Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(act): SiLU()
)
(2): C3(
(cv1): Conv(
(conv): Conv2d(64, 32, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU()
)
(cv2): Conv(
(conv): Conv2d(64, 32, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU()
)
(cv3): Conv(
(conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU()
)
(m): Sequential(
(0): Bottleneck(
(cv1): Conv(
(conv): Conv2d(32, 32, kernel_size=(1, 1), stride=(1, 1))
(act): SiLU()
)
(cv2): Conv(
(conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(act): SiLU()
)
)
)
)
我从dump结果猜测,误差是出现在cov函数中,所有我对cov函数进行cpu替换
class Conv(nn.Module):
# Standard convolution
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True): # ch_in, ch_out, kernel, stride, padding, groups
super().__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
self.bn = nn.BatchNorm2d(c2)
self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
# def forward(self, x):
# return self.act(self.bn(self.conv(x)))
# def forward_fuse(self, x):
# return self.act(self.conv(x))
def forward(self, x):
x = x.to("cpu")
cpu_cov = self.conv.to("cpu")
cpu_act = self.act.to("cpu")
cpu_bn = self.bn.to("cpu")
x = cpu_act(cpu_bn(cpu_cov(x)))
x = x.to(ct.mlu_device())
return x
def forward_fuse(self, x):
x = x.to("cpu")
cpu_cov = self.conv.to("cpu")
cpu_act = self.act.to("cpu")
return cpu_act(cpu_cov(x)).to(ct.mlu_device())
但是报错
RuntimeError: torch_mlu::conv2d() Expected a value of type 'Tensor' for argument '_7' but instead found type 'NoneType'.
Position: 7
Value: None
Declaration: torch_mlu::conv2d(Tensor _0, Tensor _1, Tensor _2, int[] _3, int[] _4, int[] _5, int _6, Tensor _7, Tensor _8) -> (Tensor _0)
是不是这里边cpu的cov2d 在转torch_mlu::conv2d时出错了呢