国产色一区,91色老久久精品偷偷蜜臀,亚洲综合久久久

前言 RepVGG: Making VGG-style ConvNets Great Again 是2021 CVPR的一篇論文，正如他的名字一樣，使用structural re-parameterization的方式讓類VGG的架構重新獲得了最好的性能和更快的速度。在本文中首先對論文進行詳細的介紹，然后再使用Pytorch復現(xiàn)RepVGG模型.
模型部署交流群：732145323。用于計算機視覺方面的模型部署、高性能計算、優(yōu)化加速、技術學習等方面的交流。

來源：DeepHub IMBA(已授權CV技術指南轉載)

原文：https://mp.weixin.qq.com/s/WpfBJCDhSn9sD3AX7n4ndw

編輯：CV技術指南

論文詳解

1、多分支模型的問題

速度：

上圖可以看到3×3 conv的理論計算密度大約是其他計算密度的4倍，這表明理論總FLOPs不是不同架構之間的實際速度的可比指標。例如，VGG-16比effentnet - b3大8.4×，但在1080Ti上運行卻速度快1.8×。

在Inception的自動生成架構中，使用多個小的操作符，而不是幾個大的操作符的多分支拓撲被廣泛采用。

NASNet-A中的碎片化量為13，這對GPU等具有強大并行計算能力的設備不友好。

內存：

多分支的內存效率很低，因為每個分支的結果都需要保存到殘差連接或連接為止，這會顯著提高內存占用的峰值。上圖顯示，一個殘差塊的輸入需要一直保持到加法。假設塊保持feature map的大小，額外內存占用的峰值為為輸入的2倍。

2、RepVGG

(a) ResNet:它在訓練和推理過程中都得到了多路徑拓撲，速度慢，內存效率低。

(b) RepVGG訓練 :僅在訓練時得到多路徑拓撲。

對于多分支，ResNets成功解釋了這樣的多分支架構使模型隱式地集成了許多較淺的模型。具體來說，當有n個塊時，模型可以解釋為2^n個模型的集合，因為每個塊都將流分支為兩條路徑。由于多分支拓撲在推理方面存在缺陷，但是分支有利于訓練，因此使用多分支來實現(xiàn)眾多模型的集成只在訓練時花費很多時間。

repvgg使用類似于 identity層（尺寸匹配時，輸入就是輸出，不做操作）和1×1卷積，因此構建塊的訓練時間信息流為y = x+g（x）+f（x），如上圖的（b）。所以模型變成了3^n個子模型的集合，包含n個這樣的塊。

為普通推斷時間模型重新設置參數(shù)：

BN在每個分支中都在加法之前使用。

設大小為C2×C1×3×3的W(3)表示3×3 核，其C1輸入通道和C2輸出通道，而大小為C2×C1的W(1)表示1×1分支核

μ(3)、 σ(3)、γ(3)、β(3)分別為3×3卷積后BN層的累積均值、標準差、學習尺度因子和偏差。

1×1 conv后的BN參數(shù)與μ(1)、 σ(1)、γ(1)、β(1)相似，同分支的BN參數(shù)與μ(0)、(0)、γ(0)、β(0)相似。

設M(1)的大小為N×C1×H1×W1, M(2)的大小為N×C2×H2×W2，分別為輸入和輸出，設*為卷積算子。

如果C1=C2, H1=H2, W1=W2，我們得到:

式中bn為推理時間bn函數(shù):

BN與Conv合并：首先將每一個BN及其前一卷積層轉換為帶有偏置矢量的卷積。設{W '， b '}為轉換后的核和偏置:

則推理時bn為:

所有分支合并：這種轉換也適用于 identity分支，因為可以將 identity層視為1×1 conv，將單位矩陣作為核。在這些轉換之后將擁有一個3×3核、兩個1×1內核和三個偏置向量。然后我們將三個偏置向量相加，得到最終的偏置。最后是3×3核，將1×1核添加到3×3核的中心點上，這可以通過將兩個1×1內核的零填充到3×3并將三個核相加來實現(xiàn)，如上圖所示。

RepVGG架構如下

3×3層分為5個階段，階段的第一層則是stride= 2。為了進行圖像分類，全局平均合并后，然后將完連接的層用作分類頭。對于其他任務，特定于任務的部可以在任何一層產(chǎn)生的特征上使用（例如分割、檢測需要的多重特征）。

五個階段分別具有1、2、4、14、1層，構建名稱為RepVGG-B。

更深的RepVGG-B，在第2、3和4階段中有2層。

也可以使用不同的a和b產(chǎn)生不同的變體。A用于縮放前四個階段，而B用于最后階段，但是要保證b> a。為了進一步減少參數(shù)和計算量，采用了interleave groupwise的3×3卷積層以換取效率。其中，RepVGG-A的第3、5、7、…、21層以及RepVGG-B額外的第23、25、27層設置組數(shù)g。為了簡單起見，對于這些層，g被全局地設置為1、2或4，而沒有進行分層調整。

3、實驗結果

REPVGG-A0在準確性和速度方面比RESNET-18好1.25％和33％，REPVGGA1比RESNET-34好0.29％/64％，REPVGG-A2比Resnet-50好0.17％/83％。

通過分組層(g2/g4)的交錯處理，RepVGG模型的速度進一步加快，精度下降較為合理:RepVGG- b1g4比ResNet-101提高了0.37%/101%，RepVGGB1g2在精度相同的情況下比ResNet-152提高了2.66倍。

雖然參數(shù)的數(shù)量不是主要問題，但可以看到以上所有的RepVGG模型都比ResNets更有效地利用參數(shù)。

與經(jīng)典的VGG-16相比，RepVGG-B2的參數(shù)僅為58%，運行速度提高10%，準確率提高6.57%。

RepVGG模型在200個epoch的精度達到80%以上。RepVGG-A2比effecentnet - b0性能好1.37%/59%，RepVGG-B1比RegNetX-3.2GF性能好0.39%，運行速度也略快。

4、消融研究

去除上圖所示的這兩個分支后，訓練時間模型退化為普通模型，準確率僅為72.39%。

使用僅使用1×1卷積和identity層精度都有所下降為 74.79%和73.15%

全功能RepVGGB0模型的準確率為75.14%，比普通普通模型高出2.75%。

分割：

上圖為使用修改后的PSPNET框架結果，修改后的PSPNET的運行速度比Resnet-50/101-backbone快得多。REPVGG 的backbone表現(xiàn)都優(yōu)于Resnet-50和Resnet-101。

下面我們開始使用Pytorch實現(xiàn)

Pytorch實現(xiàn)RepVGG

1、單與多分支模型

要實現(xiàn)RepVGG首先就要了解多分支，多分支就是其中輸入通過不同的層，然后以某種方式匯總（通常是相加）。

論文中也提到了它使眾多較淺模型的隱式集合制造了多分支模型。更具體地說，該模型可以解釋為2^n模型的集合，因為每個塊將流量分為兩個路徑。

多分支模型比單分支的模型更慢并且需要消耗更多的內存。我們先創(chuàng)建一個經(jīng)典的塊來了解原因

import torch
from torch import nn, Tensor
from torchvision.ops import Conv2dNormActivation
from typing import Dict, List

torch.manual_seed(0)

class ResNetBlock(nn.Module):
  def __init__(self, in_channels: int, out_channels: int, stride: int = 1):
      super().__init__()
      self.weight = nn.Sequential(
          Conv2dNormActivation(
              in_channels, out_channels, kernel_size=3, stride=stride
          ),
          Conv2dNormActivation(
              out_channels, out_channels, kernel_size=3, activation_layer=None
          ),
      )
      self.shortcut = (
          Conv2dNormActivation(
              in_channels,
              out_channels,
              kernel_size=1,
              stride=stride,
              activation_layer=None,
          )
          if in_channels != out_channels
          else nn.Identity()
      )

      self.act = nn.ReLU(inplace=True)

  def forward(self, x):
      res = self.shortcut(x) # <- 2x memory
      x = self.weight(x)
      x += res
      x = self.act(x) # <- 1x memory
      return x

存儲殘差會的有2倍的內存消耗。在下面的圖像中，使用上面的圖

多分支的結構僅在訓練時才有用。因此，如果可以在預測時間刪除它，是可以改善模型速度和內存消耗的，我們來看看代碼怎么做：

2、從多分支到單分支

考慮以下情況，有兩個由兩個3x3 Convs組成的分支

class TwoBranches(nn.Module):
  def __init__(self, in_channels: int, out_channels: int):
      super().__init__()
      self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3)
      self.conv2 = nn.Conv2d(in_channels, out_channels, kernel_size=3)
       
  def forward(self, x):
      x1 = self.conv1(x)
      x2 = self.conv2(x)
      return x1 + x2

看看結果

two_branches = TwoBranches(8, 8)

x = torch.randn((1, 8, 7, 7))

two_branches(x).shape

torch.Size([1, 8, 5, 5])

現(xiàn)在，創(chuàng)建一個convv，我們稱其為“ conv_fused”，conv_fused(x) = conv1(x) + conv2(x)。我們可以將兩個卷積的權重和偏置求和，根據(jù)卷積的特性這是沒問題的。

conv1 = two_branches.conv1
conv2 = two_branches.conv2

conv_fused = nn.Conv2d(conv1.in_channels, conv1.out_channels, kernel_size=conv1.kernel_size)

conv_fused.weight = nn.Parameter(conv1.weight + conv2.weight)
conv_fused.bias = nn.Parameter(conv1.bias + conv2.bias)

# check they give the same output
assert torch.allclose(two_branches(x), conv_fused(x), atol=1e-5)

讓我們對它的速度！

from time import perf_counter

two_branches.to('cuda')
conv_fused.to('cuda')

with torch.no_grad():
  x = torch.randn((4, 8, 7, 7), device=torch.device('cuda'))
   
  start = perf_counter()
  two_branches(x)
  print(f'conv1(x) + conv2(x) tooks {perf_counter() - start:.6f}s')
   
  start = perf_counter()
  conv_fused(x)
  print(f'conv_fused(x) tooks {perf_counter() - start:.6f}s')

速度快了一倍

conv1(x) + conv2(x) tooks 0.000421s
conv_fused(x) tooks 0.000215s

3、Fuse Conv和Batschorm

BATGNORM被用作卷積塊之后層。論文中將它們融合在一起，即conv_fused(x) = batchnorm(conv(x))。

論文的2個公式解釋這里截圖在一起了，為了方便查看：

代碼是這樣的：

def get_fused_bn_to_conv_state_dict(
  conv: nn.Conv2d, bn: nn.BatchNorm2d
) -> Dict[str, Tensor]:
  # in the paper, weights is gamma and bias is beta
  bn_mean, bn_var, bn_gamma, bn_beta = (
      bn.running_mean,
      bn.running_var,
      bn.weight,
      bn.bias,
  )
  # we need the std!
  bn_std = (bn_var + bn.eps).sqrt()
  # eq (3)
  conv_weight = nn.Parameter((bn_gamma / bn_std).reshape(-1, 1, 1, 1) * conv.weight)
  # still eq (3)
  conv_bias = nn.Parameter(bn_beta - bn_mean * bn_gamma / bn_std)
  return {'weight': conv_weight, 'bias': conv_bias}

讓我們看看它怎么工作：

conv_bn = nn.Sequential(
  nn.Conv2d(8, 8, kernel_size=3, bias=False),
  nn.BatchNorm2d(8)
)

torch.nn.init.uniform_(conv_bn[1].weight)
torch.nn.init.uniform_(conv_bn[1].bias)

with torch.no_grad():
  # be sure to switch to eval mode!!
  conv_bn = conv_bn.eval()
  conv_fused = nn.Conv2d(conv_bn[0].in_channels,
                          conv_bn[0].out_channels,
                          kernel_size=conv_bn[0].kernel_size)

  conv_fused.load_state_dict(get_fused_bn_to_conv_state_dict(conv_bn[0], conv_bn[1]))

  x = torch.randn((1, 8, 7, 7))
   
  assert torch.allclose(conv_bn(x), conv_fused(x), atol=1e-5)

論文就是這樣的方式融合了Conv2D和BatchRorm2D層。

其實可以看到論文的目標是一個：將整個模型融合成在一個單一的數(shù)據(jù)流中（沒有分支），使網(wǎng)絡更快！

作者提出新的RepVgg塊。與ResNet類似是有殘差的，但通過identity層使其更快.

繼續(xù)上面的圖,pytorch的代碼如下：

class RepVGGBlock(nn.Module):
  def __init__(self, in_channels: int, out_channels: int, stride: int = 1):
      super().__init__()
      self.block = Conv2dNormActivation(
          in_channels,
          out_channels,
          kernel_size=3,
          padding=1,
          bias=False,
          stride=stride,
          activation_layer=None,
          # the original model may also have groups > 1
      )

      self.shortcut = Conv2dNormActivation(
          in_channels,
          out_channels,
          kernel_size=1,
          stride=stride,
          activation_layer=None,
      )

      self.identity = (
          nn.BatchNorm2d(out_channels) if in_channels == out_channels else None
      )

      self.relu = nn.ReLU(inplace=True)

  def forward(self, x):
      res = x # <- 2x memory
      x = self.block(x)
      x += self.shortcut(res)
      if self.identity:
          x += self.identity(res)
      x = self.relu(x) # <- 1x memory
      return x

4、參數(shù)的重塑

一個3x3 conv-> bn，一個1x1 conv-bn和（有時）一個batchnorm（identity分支）。要想將它們融合在一起，創(chuàng)建一個conv_fused，conv_fused=3x3conv-bn(x) + 1x1conv-bn(x) + bn(x)，或者如果沒有identity層，conv_fused=3x3conv-bn(x) + 1x1conv-bn(x)。

為了創(chuàng)建這個conv_fused，我們需要做如下的操作：

將3x3conv-bn（x）融合到一個3x3conv中
1x1conv-bn（x），然后將其轉換為3x3conv
將identity的BN轉換為3x3conv
所有三個3x3convs相加

下圖就是論文的總結：

第一步很容易，我們可以在RepVGGBlock.block（主3x3 Conver-bn）上使用get_fused_bn_to_conv_state_dict。

第二步也類似的，在RepVGGBlock.shortcut上（1x1 cons-bn）使用get_fused_bn_to_conv_state_dict。這就是論文說的在每個維度上用1填充融合的1x1的核，形成一個3x3。

identity的bn比較麻煩。論文的技巧（trick）是創(chuàng)建3x3 Conv來模擬identity，它將作為一個恒等函數(shù)，然后使用get_fused_bn_to_conv_state_dict將其與identity bn融合。還是通過在對應的內核中心為對應的通道的權重設置成1來實現(xiàn)。

Conv的權重是in_channels, out_channels, kernel_h, kernel_w。如果我們要創(chuàng)建一個identity ，conv(x) = x，我只需要將權重設為1即可,代碼如下：

with torch.no_grad():
  x = torch.randn((1,2,3,3))
  identity_conv = nn.Conv2d(2,2,kernel_size=3, padding=1, bias=False)
  identity_conv.weight.zero_()
  print(identity_conv.weight.shape)

  in_channels = identity_conv.in_channels
  for i in range(in_channels):
      identity_conv.weight[i, i % in_channels, 1, 1] = 1

  print(identity_conv.weight)
   
  out = identity_conv(x)
  assert torch.allclose(x, out)

結果

torch.Size([2, 2, 3, 3])
Parameter containing:
tensor([[[[0., 0., 0.],
        [0., 1., 0.],
        [0., 0., 0.]],         [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]],
      [[[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],         [[0., 0., 0.],
        [0., 1., 0.],
        [0., 0., 0.]]]], requires_grad=True)

我們創(chuàng)建了一個Conv，它的作用就像一個恒等函數(shù)。把所有的東西放在一起，就是論文中的參數(shù)重塑。

def get_fused_conv_state_dict_from_block(block: RepVGGBlock) -> Dict[str, Tensor]:
  fused_block_conv_state_dict = get_fused_bn_to_conv_state_dict(
      block.block[0], block.block[1]
  )

  if block.shortcut:
      # fuse the 1x1 shortcut
      conv_1x1_state_dict = get_fused_bn_to_conv_state_dict(
          block.shortcut[0], block.shortcut[1]
      )
      # we pad the 1x1 to a 3x3
      conv_1x1_state_dict['weight'] = torch.nn.functional.pad(
          conv_1x1_state_dict['weight'], [1, 1, 1, 1]
      )
      fused_block_conv_state_dict['weight'] += conv_1x1_state_dict['weight']
      fused_block_conv_state_dict['bias'] += conv_1x1_state_dict['bias']
  if block.identity:
      # create our identity 3x3 conv kernel
      identify_conv = nn.Conv2d(
          block.block[0].in_channels,
          block.block[0].in_channels,
          kernel_size=3,
          bias=True,
          padding=1,
      ).to(block.block[0].weight.device)
      # set them to zero!
      identify_conv.weight.zero_()
      # set the middle element to zero for the right channel
      in_channels = identify_conv.in_channels
      for i in range(identify_conv.in_channels):
          identify_conv.weight[i, i % in_channels, 1, 1] = 1
      # fuse the 3x3 identity
      identity_state_dict = get_fused_bn_to_conv_state_dict(
          identify_conv, block.identity
      )
      fused_block_conv_state_dict['weight'] += identity_state_dict['weight']
      fused_block_conv_state_dict['bias'] += identity_state_dict['bias']

  fused_conv_state_dict = {
      k: nn.Parameter(v) for k, v in fused_block_conv_state_dict.items()
  }

  return fused_conv_state_dict

最后定義一個RepVGGFastBlock。它只是由conv + relu組成

class RepVGGFastBlock(nn.Sequential):
  def __init__(self, in_channels: int, out_channels: int, stride: int = 1):
      super().__init__()
      self.conv = nn.Conv2d(
          in_channels, out_channels, kernel_size=3, stride=stride, padding=1
      )
      self.relu = nn.ReLU(inplace=True)

并在RepVGGBlock中添加to_fast方法來快速創(chuàng)建RepVGGFastBlock

class RepVGGBlock(nn.Module):
  def __init__(self, in_channels: int, out_channels: int, stride: int = 1):
      super().__init__()
      self.block = Conv2dNormActivation(
          in_channels,
          out_channels,
          kernel_size=3,
          padding=1,
          bias=False,
          stride=stride,
          activation_layer=None,
          # the original model may also have groups > 1
      )

      self.shortcut = Conv2dNormActivation(
          in_channels,
          out_channels,
          kernel_size=1,
          stride=stride,
          activation_layer=None,
      )

      self.identity = (
          nn.BatchNorm2d(out_channels) if in_channels == out_channels else None
      )

      self.relu = nn.ReLU(inplace=True)

  def forward(self, x):
      res = x # <- 2x memory
      x = self.block(x)
      x += self.shortcut(res)
      if self.identity:
          x += self.identity(res)
      x = self.relu(x) # <- 1x memory
      return x

  def to_fast(self) -> RepVGGFastBlock:
      fused_conv_state_dict = get_fused_conv_state_dict_from_block(self)
      fast_block = RepVGGFastBlock(
          self.block[0].in_channels,
          self.block[0].out_channels,
          stride=self.block[0].stride,
      )

      fast_block.conv.load_state_dict(fused_conv_state_dict)

      return fast_block

5、RepVGG

switch_to_fast方法來定義RepVGGStage(塊的集合)和RepVGG:

class RepVGGStage(nn.Sequential):
  def __init__(
      self,
      in_channels: int,
      out_channels: int,
      depth: int,
  ):
      super().__init__(
          RepVGGBlock(in_channels, out_channels, stride=2),
          *[RepVGGBlock(out_channels, out_channels) for _ in range(depth - 1)],
      )


class RepVGG(nn.Sequential):
  def __init__(self, widths: List[int], depths: List[int], in_channels: int = 3):
      super().__init__()
      in_out_channels = zip(widths, widths[1:])

      self.stages = nn.Sequential(
          RepVGGStage(in_channels, widths[0], depth=1),
          *[
              RepVGGStage(in_channels, out_channels, depth)
              for (in_channels, out_channels), depth in zip(in_out_channels, depths)
          ],
      )

      # omit classification head for simplicity

  def switch_to_fast(self):
      for stage in self.stages:
          for i, block in enumerate(stage):
              stage[i] = block.to_fast()
      return self

這樣就完成了，下面我們看看測試

6、模型測試

benchmark.py中已經(jīng)創(chuàng)建了一個基準，在gtx 1080ti上運行不同批處理大小的模型，這是結果:

模型每個階段有兩層，四個階段，寬度為64,128,256,512。

在他們的論文中，他們將這些值按一定的比例縮放(稱為a和b)，并使用分組卷積。因為對重新參數(shù)化部分更感興趣，所以這里跳過了，因為這是一個調參的過程，可以使用超參數(shù)搜索的方法得出。

基本上重塑參數(shù)的模型與普通模型相比在不同的時間尺度上提升的還是很明顯的

可以看到，對于batch_size=128，默認模型(多分支)占用1.45秒，而參數(shù)化模型(快速)只占用0.0134秒。即108倍的提升

總結

在本文中，首先詳細的介紹了RepVGG的論文，然后逐步了解了如何創(chuàng)建RepVGG，并且著重介紹了重塑權重的方法，并且用Pytorch復現(xiàn)了論文的模型，RepVGG這種重塑權重技術其實就是使用了過河拆橋的方法，白嫖了多分支的性能，并且還能夠提升，你說氣不氣人。這種“白嫖”的技術也可以移植到其他架構中。

論文地址在這里：

http://arxiv.org/abs/2101.03697

代碼在這里：

https://github.com/FrancescoSaverioZuppichini/RepVgg

若覺得還不錯的話，請點個 “贊” 或 “在看” 吧

其它文章

CVPR 2022｜Oriented RepPoints: 旋轉小目標新解法

開發(fā)算法的痛點是什么

Vision Transformer在CV任務中的速度如何保證？

模型預測概率與真實概率差異大的原因與校準方法總結

入門必讀系列（十二）池化各要點與各方法總結

TensorRT教程（一）初次介紹TensorRT