Ascend C 动态 Shape 算子开发实战：从原理到代码实现

动态 Shape 算子指输入张量的 shape 可在一定范围内动态变化（如 N 维度支持 1~1024，H/W 维度支持 256~2048），算子通过 API 动态推导输入输出 shape，无需硬编码固定维度。分块大小自适应：根据输入 shape 动态调整分块大小（如小 shape 用 32×32 分块，大 shape 用 64×64 分块），平衡计算效率与内存占用。线程配置优化：根据分块大小动态

2401_89199769

1870人浏览 · 2025-10-25 21:56:08

2401_89199769 · 2025-10-25 21:56:08 发布

动态 Shape 算子是 Ascend C 开发中的重点难点，能够适配不同输入规模的张量计算，广泛应用于深度学习框架中的灵活组网场景。本文结合加法算子案例，详细讲解动态 Shape 算子的开发原理、实现步骤及优化技巧。

一、动态 Shape 算子核心原理

（一）定义

动态 Shape 算子指输入张量的 shape 可在一定范围内动态变化（如 N 维度支持 1~1024，H/W 维度支持 256~2048），算子通过 API 动态推导输入输出 shape，无需硬编码固定维度。

（二）核心优势

通用性强：适配多种输入规模，无需为不同 shape 编写多个算子版本。
框架兼容性好：支持 PyTorch、TensorFlow 等框架的动态组网需求（如动态 batch size）。
开发效率高：一次开发可覆盖多场景使用，降低维护成本。

（三）关键技术

Shape 动态推导：通过 GetShape、GetDim 等 API 获取输入 shape，动态计算分块参数与输出 shape。
资源动态分配：根据推导的 shape 动态初始化 Local Memory 缓冲区、线程配置等资源。
边界处理：适配不同 shape 的分块边界，确保计算无数据丢失。

二、动态 Shape 算子实现步骤（加法算子案例）

（一）Step 1：算子原型定义（支持动态维度）

在原型文件add_custom.json中，将 shape 维度定义为动态参数（如 N、C、H、W），不指定固定值：

{
    "opName": "AddCustomDynamic",
    "inputDesc": [
        {
            "name": "A",
            "dtype": "float32",
            "format": "NHWC",
            "shape": ["N", "C", "H", "W"]
        },
        {
            "name": "B",
            "dtype": "float32",
            "format": "NHWC",
            "shape": ["N", "C", "H", "W"]
        }
    ],
    "outputDesc": [
        {
            "name": "C",
            "dtype": "float32",
            "format": "NHWC",
            "shape": ["N", "C", "H", "W"]
        }
    ]
}

（二）Step 2：Host 侧 Tiling 实现（动态分块）

Tiling 结构体定义：包含动态分块参数

// add_custom_tiling.h
struct AddCustomDynamicTiling {
    int32_t tile_h;      // 动态计算的H维度分块大小
    int32_t tile_w;      // 动态计算的W维度分块大小
    int32_t tile_num_h;  // H维度分块数量
    int32_t tile_num_w;  // W维度分块数量
    int32_t total_tile;  // 总分块数量
};

Tiling 函数实现（动态推导分块参数）

// add_custom.cpp
Status AddCustomDynamicTilingFunc(const ge::Operator &op, AddCustomDynamicTiling &tiling) {
    // 获取输入shape
    auto input_desc = op.GetInputDesc(0);
    auto shape = input_desc.GetShape();
    int32_t h = shape.GetDim(2);
    int32_t w = shape.GetDim(3);
    
    // 动态计算分块大小（单块不超过Local Memory容量，此处按64×64适配）
    tiling.tile_h = (h > 64) ? 64 : h;
    tiling.tile_w = (w > 64) ? 64 : w;
    
    // 计算分块数量（向上取整，处理边界分块）
    tiling.tile_num_h = (h + tiling.tile_h - 1) / tiling.tile_h;
    tiling.tile_num_w = (w + tiling.tile_w - 1) / tiling.tile_w;
    tiling.total_tile = tiling.tile_num_h * tiling.tile_num_w;
    
    return SUCCESS;
}

（三）Step 3：Kernel 类实现（动态资源初始化）

// add_custom_kernel.cpp
class KernelAddDynamic {
public:
    Status Init(__gm__ const AddCustomDynamicTiling &tiling) {
        // 动态初始化Local Memory缓冲区（基于分块大小）
        tile_h_ = tiling.tile_h;
        tile_w_ = tiling.tile_w;
        local_a_ = new (std::nothrow) float[tile_h_ * tile_w_];
        local_b_ = new (std::nothrow) float[tile_h_ * tile_w_];
        if (local_a_ == nullptr || local_b_ == nullptr) {
            GE_LOGE("Local memory allocation failed!");
            return MEMALLOC_FAILED;
        }
        return SUCCESS;
    }
    
    // 计算方法（适配动态分块）
    Status Compute(__gm__ const float *A, __gm__ const float *B, __gm__ float *C,
                   __gm__ const AddCustomDynamicTiling &tiling, int32_t tile_idx) {
        // 动态计算当前分块的偏移量
        int32_t h_offset = (tile_idx / tiling.tile_num_w) * tile_h_;
        int32_t w_offset = (tile_idx % tiling.tile_num_w) * tile_w_;
        int32_t input_w = tiling.tile_num_w * tile_w_;
        
        // 加载当前分块数据（处理边界分块，避免越界）
        int32_t actual_h = (h_offset + tile_h_ > tiling.tile_num_h * tile_h_) ? 
                          (tiling.tile_num_h * tile_h_ - h_offset) : tile_h_;
        int32_t actual_w = (w_offset + tile_w_ > input_w) ? (input_w - w_offset) : tile_w_;
        
        ld_matrix(local_a_, A + h_offset * input_w + w_offset, actual_h, actual_w);
        ld_matrix(local_b_, B + h_offset * input_w + w_offset, actual_h, actual_w);
        
        // 并行计算
        int32_t local_idx = threadIdx.x;
        if (local_idx < actual_h * actual_w) {
            int32_t h = local_idx / actual_w;
            int32_t w = local_idx % actual_w;
            local_a_[h * actual_w + w] += local_b_[h * actual_w + w];
        }
        
        // 写回结果
        st_matrix(C + h_offset * input_w + w_offset, local_a_, actual_h, actual_w);
        return SUCCESS;
    }
    
private:
    float *local_a_;
    float *local_b_;
    int32_t tile_h_;
    int32_t tile_w_;
};

（四）Step 4：Shape 推导与算子注册

Shape 推导（动态校验输入）

Status AddCustomDynamic::InferShape(const ge::Operator &op, vector<ge::TensorDesc> &output_desc) {
    auto a_desc = op.GetInputDesc(0);
    auto b_desc = op.GetInputDesc(1);
    
    // 动态校验输入shape维度数量（需为4维）
    if (a_desc.GetShape().GetDimNum() != 4) {
        GE_LOGE("Input shape dim num error! Expected 4, got %d", a_desc.GetShape().GetDimNum());
        return PARAM_INVALID;
    }
    
    // 动态推导输出shape（与输入一致）
    output_desc.push_back(a_desc);
    return SUCCESS;
}

算子注册

REG_OP(AddCustomDynamic)
    .INPUT(A, TensorType({DT_FLOAT}))
    .INPUT(B, TensorType({DT_FLOAT}))
    .OUTPUT(C, TensorType({DT_FLOAT}))
    .OPERATOR_CLASS(AddCustomDynamic)
    .TILING_FUNC(AddCustomDynamicTilingFunc);

三、动态 Shape 算子优化技巧

分块大小自适应：根据输入 shape 动态调整分块大小（如小 shape 用 32×32 分块，大 shape 用 64×64 分块），平衡计算效率与内存占用。
缓冲区复用：在 Kernel 类中复用 Local Memory 缓冲区，避免频繁分配释放内存。
线程配置优化：根据分块大小动态调整线程块数量（如dim3 block(tile_h_ * tile_w_)），充分利用芯片算力。
边界分块单独处理：对不能被分块大小整除的边界分块，单独计算实际数据大小，避免越界访问。