Heterogeneous optimization
The computing power of each processor is fully utilized to enhance the overall computing performance.
Deep model customization
The model algorithm characteristics and hardware characteristics are combined to transform the model data and algorithms.
How it works
Rational division of computing tasks and allocation of tasks according to the computing characteristics and computing power of GPU/NPU enable efficient parallel computing by each processor.
The computation graph is reconstructed according to the computing power and instruction set characteristics of each hardware architecture (NPU/GPU), and the operators are rationally merged and split. Meanwhile, on the instruction layer, measures such as improving the cache hit rate, single-cycle multi-instruction parallel execution and efficient instruction pipelining are used to improve computing efficiency.
User perception
+ Power consumption: Over 20% reduction per token
+ Performance: 30% improvement in the decoding phase