# PyTorch and CEDR: Enabling Deployment of Machine Learning Models on Heterogeneous Computing Systems

**Umut Suluhan<sup>1</sup>**, Serhan Gener<sup>1</sup>, Alexander Fusco<sup>1</sup>, Fatih Ugurdag<sup>2</sup>, and Ali Akoglu<sup>1</sup>

<sup>1</sup>University of Arizona, <sup>2</sup>Ozyegin University





#### Goal and Motivation

PyTorch offers productive GPU-based







A steep increase in computation & memory demand





No path forward for deploying ML models on systems offering balance between throughput and energy efficiency

Aim: Productive PyTorch model deployment on heterogeneous SoCs

- hardware agnostic application development
- balance trade-off between throughput and energy efficiency
- explore SoC configurations for PyTorch based workflows

[1] Systems, C. (2022, August 30). Cerebras Architecture Deep Dive: First look inside the HW/SW co-design for Deep Learning. Medium.

#### **Technical Contributions**



Hardware agnostic model development and deployment experience for PyTorch developers across rich set of off-the-shelf SoC platforms



# CEDR¹ - A Compiler-Integrated, Extensible DSSoC Runtime

Open source<sup>2</sup>, unified environment for programming and execution on heterogenous SoCs

#### **Key features:**

- Provides users with an abstraction layer through APIs
- Refactors applications into a sequence of hardware agnostic function calls
- Generates application representation that allows run time system invoke each function call on its supported processing elements
- Flexible to execute arbitrary, interleaved workloads on various accelerators
- Portable across off the shelf SoC platforms
- Avoids requiring users to become hardware experts



#### Need for a transformation tool that can translate PyTorch models into a CEDR compatible application representation



[1] J. Mack et al. "CEDR-API: Productive, Performant Programming of Domain-Specific Embedded Systems," IEEE Electrical & Computer International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2023, https://doi.org/10.1109/IPDPSW59300.2023.00016

[2] Project Homepage: https://ua-rcl.github.io/projects/cedr/

% GNURadio







```
1 def conv2d(in channel, out channel, kernel size,
              stride, padding):
     return nn.Sequential(
        nn.Conv2d(in channel, out channel,
        kernel size, stride, padding)
        nn.ReLU())
 7 class Model(nn.Module):
     def init (self):
        self.conv relu = conv2d(3, 8, 3)
 9
10
        self.linear = nn.Linear(32, 16)
        self.maxpool = nn.MaxPool2d(2, 2)
11
12
     def forward(self, x):
13
        conv1 = self.conv relu(x)
        pool1 = self.maxpool(conv1)
14
15
        linear1 = self.linear(pool1)
        return linear1
16
```

```
"sequential": "yes",
        "name": "conv_relu",
        "length": 2,
        "0": {"name":
        "conv relu 0",
              "type": "Conv2d",
              "in channels: 3,
STEP 1
              "out_channels": 8,
              "kernel_size": [3,3]
        "1": {"name":
         'conv relu 1",
              "type": "ReLU" }}
        {"sequential": "no",
        "name": "linear",
        "type": "Linear",
        "in_features": 32,
        "out features": 16}
        "sequential": "no",
        "name": "maxpool",
        "type": "MaxPool2d"}
```

- Extracts information regarding each layer
- Distinguishes layers from each other with distinct attributes





```
{"sequential": "yes",
        "name": "conv relu",
        "length": 2,
        "0": {"name":
        "conv relu 0",
              "type": "Conv2d",
              "in channels: 3,
STEP 1
              "out channels": 8,
             "kernel_size": [3,3]
                                       STEP 2
       "1": {"name":
        "conv_relu_1",
              "type": "ReLU" }}
        "sequential": "no",
        "name": "linear",
        "type": "Linear",
        "in features": 32,
        "out_features": 16}
        "sequential": "no",
        "name": "maxpool",
        "type": "MaxPool2d"}
```

"input": "x",

"length": 2,

"0": {"name":

"conv relu 0",

"1": {"name":

"conv\_relu\_1",

{"name": "maxpool",

"input": "conv1",

"output": "pool1",

'sequential": "no"

"name": "linear",

"output": "linear1", "next": "NONE", "sequential": "no", "type": "Linear"}

"input": "pool1",

'type": "Maxpool2d"}

"next": "linear",

"output": "conv1",

"next": "maxpool",

- "name": "conv relu", Obtains DAG based on input-"sequential": "ves", output "type": "Conv2d", relationship "in channels: 3, "out\_channels": 8, between layers "kernel\_size": [3,3] }, "type": "ReLU" }}
  - Allows building C++ based model while respecting the dataflow and layer attributes

```
"name": "conv_relu",
'input": "x",
"output": "conv1",
'next": "maxpool",
'sequential": "yes",
"length": 2,
"0": {"name":
"conv relu 0",
     "type": "Conv2d",
     "in_channels: 3,
     "out channels": 8,
     "kernel_size": [3,3] },
"1": {"name":
"conv_relu_1",
     "type": "ReLU" }}
"name": "maxpool",
'input": "conv1",
'output": "pool1",
"next": "linear",
'sequential": "no"
'type": "Maxpool2d"}
"name": "linear",
"input": "pool1",
"output": "linear1",
'next": "NONE",
'sequential": "no",
"type": "Linear"}
```

```
Conv2d* conv_relu_0 = new Conv2d(3, 8, 3);

ReLU* relu = new ReLU();

Linear* linear = new Linear(32, 16);

Maxpool2d* maxpool = new Maxpool2d();

Module* module = new Module();

module->add(*conv_relu_0);

module->add(*linear);

module->filter_assign();

x = conv_relu_0->forward(x);

Tensor3D* conv1 = relu->forward(x);

Tensor3D* pool1 = maxpool->forward(conv1);

Tensor2D* linear1 = linear->forward(pool1);
```

- Maps each layer in the DAG to equivalent C++ implementation
- Serves as a baseline model for replacing key kernels with hardware agnostic CEDR compatible API calls





- Key kernels are replaced with CEDR compatible hardware agnostic API calls through function inlining
- Electrical & Computer Engineering
- Generates final C++ model with API implementations that can be compiled and executed with CEDR.

```
void conv(float *in, float *filter, float *bias, float *out, int height, int width,
             int kernel size, int filter number, int in channel){
     for(int i = 0; i < filter number; i++){
     // Memory allocation operations
         for(int j = 0; j < in channel; j++){
            float *conv_output = (float *) malloc (sizeof (float) * height * width);
            CEDR CONV 2D(&(in[j * height * width]), height, width,
               (filter[((i * in_channel + j) * kernel_size * kernel_size)]),
               kernel size, conv output);
             CEDR_ZIP(&conv_output, &(out[(i * height * width)]),
                &(out[(i * height * width)]), height * width / 2, ZIP_ADD);
                &(out[(i * height * width)]), height * width / 2, ZIP ADD);}}}
13 void linear(float *in, float *filter, float *bias, float *out, int channel,
              int in channel, int out channel){
      float*filter t = (float*) malloc (sizeof (float) * in channel * out channel);
      transpose linear weight(filter, filter t, out channel, in channel);
      // Tensor manupilations
      CEDR_GEMM(in, filter_t, out, channel, in_channel, out_channel);
      // Tensor manupilations}
```



- C++ model with API calls is compiled and prepared for execution on heterogenous SoC
- Integrated scheduler makes task to processing element mapping decisions based on current state of system resources and performance goals.



### Experimental Setup

**Hardware Composition** 

- 3 CPUs
- Accelerators (Conv2D, FFT, ZIP)

Xilinx Zynq UltraScale+ MPSoC ZCU102

**Workload Composition** Object Detection

Visual Geometry Group

Speech Classification
Wifi-TX

Pulse Doppler (10-1)



- Earliest Finish Time (EFT)
- Earliest Time to Finish (ETF)
- Heterogeneous Earliest Finish Time (HEFT-RT¹)



Engineering

Electrical & Computer [1] J. Mack et al. "Performant, multi-objective scheduling of highly interleaved task graphs on heterogeneous system on chip devices," IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 9, pp. 2148-2162, 2022 https://doi.org/10.1109/TPDS.2021.3135876

### Cross-Domain Applications



- Saturation trend with respect to workload complexity (injection rate): oversubscribed system
- Resource rich configuration saturates latest
- Execution time reduces with the increased degree of heterogeneity



# PyTorch Applications

CEDR can execute multiple models concurrently on the target SoC in dynamically arriving workload scenarios







First 10 seconds of the Gantt chart for multiple PyTorch applications running on 3 CPUs, 1 C2D, and 1 ZIP accelerator with EFT scheduler

#### Demo





#### Related Work

Other works in the literature offer system-level solutions and allow running neural network workloads on heterogeneous SoCs:

- Zhong et al.¹ leverages FPGA accelerators and NEON engine cores to offload convolution workloads, targeting experienced engineers
- Shea et al.<sup>2</sup> designs a hardware accelerator specifically for neural network workloads and introduces heterogeneity-aware scheduler
- Dagli et al.<sup>3</sup> implements layer-level design time scheduling, aiming to strike a balance between energy and performance trade-offs.

Proposed transformation tool is unique when coupled with CEDR as other methods fall short in terms of programmer productivity and can not cope with dynamically arriving workload scenarios



[1] G. Zhong et al. "Synergy: An hw/sw framework for high throughput cnns on embedded heterogeneous soc," ACM Transactions on Embedded Computing Systems (TECS), vol. 18, no. 2, pp. 1–23, 2019.

[2] C. Shea et al. "Heterogeneous scheduling of deep neural networks for low-power real-time designs," ACM Journal on Emerging Technologies in Computing Systems (JETC), vol. 15, no. 4, pp. 1–31, 2019.

[3] I. Dagli et al. "Axonn:energy-aware execution of neural network inference on multi-accelerator heterogeneous socs," Proceedings of the 59th ACM/IEEE Design Automation Conference, 2022, pp. 1069–1074.

#### Conclusions and Future Work

- For the first time, PyTorch application developers have access to FPGA-based execution without having to become hardware experts
  - balance trade-off between throughput and energy efficiency
  - explore SoC configurations for dynamic workloads

#### Next Step

 Generalize the framework for supporting wide range of ML models.





# Thank you

- Questions?
- Links
  - Website: <a href="https://ua-rcl.github.io/projects/cedr/">https://ua-rcl.github.io/projects/cedr/</a>
  - Source code: <a href="https://github.com/UA-RCL/CEDR/">https://github.com/UA-RCL/CEDR/</a>
- Contact
  - Umut Suluhan: <u>suluhan@arizona.edu</u>













# **BACKUP**



#### Results - BACKUP

• Sophisticated schedulers make better scheduling decisions with the increase in resource pool since it considers execution time of all processing elements.





#### Results - BACKUP

• Rich resource pool enables system to overlap executions of applications leading to faster execution per instance with the increase in the number of instances running simultaneously.



Scheduler comparison for increasing number of instances

