Software Optimization Design of MPEG-4 ASP Video Encoder

This article refers to the address: http://

Abstract : This paper introduces the new tools of TMS320C6416 DSP and MPEG-4ASP (Advanced Simple Profile) video encoder based on SP. It elaborates the software optimization method based on the platform to realize MPEG-4ASP video encoder. Finally, the experimental results are obtained. The comparison shows the superiority of ASP over SP encoders in embedded system applications. It can be seen that in the case of limited storage capacity, it is more suitable to use ASP for MPEG-4 video coding.
Keywords : MPEG-4; video encoder; software optimization

introduction
The MPEG-4SP (Simple Profile) encoder has received extensive attention for its superior compression efficiency and image quality, and has since produced many PC-based codecs (such as Divx, Xvid, etc.) in distance education and high definition. Degree films and other aspects have been widely used. The ASP encoder included in the MPEG-4 standard V2.0 in 2001 has added some new tools to the SP to further improve the compression efficiency, so it is more suitable for embedded in wireless video communication and digital video cameras. Application in the system.

1 Introduction to hardware platform TMS320C6416

The experimental hardware platform selected is the TMS320C6416 DSK (DSP Starter Kit). Its core processor is TI's high-performance fixed-point 32-bit DSP C6416, based on the 2nd generation high-performance Ve2lociTI. 2 VLIW structure, with 64 32-bit word length registers, 8 highly independent functional units (2 Multiplication unit, 6 arithmetic logic units), operating clock frequency is 600MHz, peak processing speed can reach 4800Mbit/s. The C6416 DSP has 1MB of on-chip memory and a two-level cache structure. Among them, L1P and L1D directly connected to the CPU can run at the same speed as the CPU, and L2CACHE has 5 configuration modes, which can set the size of L2CACHE according to actual needs. At the same time, the C6416 also has 64 independent EDMA channels, which can perform a large amount of data movement in the background of the CPU, and integrates 16MB of SDRAM, which can be configured as a cache to improve access efficiency.

2 MPEG-4ASP video coding

The Moving Picture Experts Group MPEG added some new tools and frameworks to its newly released V2.0 version in 2001, including ASP. On the basis of SP, ASP adds support for B-VOP, 1/4 pixel precision motion vector, optional quantizer, global motion compensation GMC, etc., which further improves the compression efficiency.
(1) B-VOP uses bidirectional prediction to improve the efficiency of motion compensation, that is, each block block or macroblock macroblock can be weighted by forward and backward prediction.
(2) 1/4-pixel motion vector: The reference VOP is first interpolated at 1/2 pixel position and then at 1/4 before motion estimation and compensation, although this increases motion estimation, motion compensation, and image reconstruction. The complexity, but the coding efficiency is improved compared to the SP encoder.
(3) Optional quantizer: An optional inverse quantization method is provided in ASP. In this way, the quantized coefficients FQ ( u, v) are inverse quantized in the following way to generate the coefficients F (u, v): if (F Q = 0) F = 0; elseF = [ (2 × F c ( u, v) + k) × W W ( u, v) × QP ] / 16. Where W W is an 8 × 8 weighting factor matrix. This inverse quantization method allows the encoder to use W W to change the step size according to the position of the quantized coefficient in the block.
(4) Global Motion Compensation (GMC): Macroblocks in the same video object (VO) may experience similar motion, such as linear motion caused by zooming and rotation of the camera lens, some of which may move in the same direction. An encoder with GMC can simply describe this "global" motion for the entire VOP by sending a small amount of motion parameters. Therefore, when a significant number of macroblocks in a VOP have the same motion characteristics, GMC can significantly improve compression efficiency.

3 software transplantation and optimization

Because the DSP is different from the ordinary PC environment, it is simple to put the code on the DSP to compile, the operation efficiency is low or even impossible to run, and the code transplantation, rewriting and optimization work suitable for the DSP characteristics must be performed to achieve the real-time requirement.

3. 1 software porting to make the code suitable for running on the DSP platform, first delete a lot of printf and other debugging information in the program code, use puts to replace the necessary information output to reduce the function overhead; use double type definition for double type data ; Delete unnecessary floating-point operations (such as PSNR calculations), the necessary floating-point operations are achieved by scaling.

3. 2 memory optimization
The C6416DSP has 1MB of on-chip memory and can be accessed at maximum CPU clock frequency. 16Mb/s SDRAM is integrated on the DSK and can be accessed at EM IFA at 100MHz. The difference in access speed and the CPU's addressing of external storage space will cause the pipeline to stop for several cycles. Therefore, how to properly utilize the C6416's on-chip memory space and L2 cache structure has become a critical factor. The 1MB storage space is divided into 256k L2CACHE and 768k L2SRAM, code segments, global data, etc. are placed on the on-chip memory L2SRAM, and the external SDRAM is set to be cacheable to improve access efficiency. These settings can be done by calling the CSL (Chip Support Library) library function:

#include < csl. h >
#include < csl_cache. h >
CSL_init ( ) ;
CSL_enableCaching(CACHE_EM IFA_CE00) ;
CACHE_setL2Mode (CACHE_256 k CACHE) .

3. 3 project level optimization
TI provides a set of compilation optimization parameters for its integrated compilation environment CCS that can be selected based on code performance requirements. Therefore, it is possible to combine and optimize various parameters ( - mw, - pm, - o3, - mt, etc.), which can be done by the PBC option of CCS 2. 20. At the same time, in the process of code linking, the code segment link order is arranged in a certain way, which can reduce the cache miss caused by the code call when the program is executed, and improve the execution efficiency of the program.

3. 4 Code Optimization Code optimization is an important part of MPEG-4 ASP video encoder software development. Unoptimized code is very inefficient on DSK platform, and only one frame is encoded for about 25 seconds, and real-time performance. The indicator is more than 25 frames per second.

(1) Using TI library functions
TI provides the image processing function library IMGL IB, which can call the functions for FDCT and IDCT transformation.

(2) Rewriting the C code Firstly, the loop operation in the program is decomposed and unfolded, and the inner and outer layers of the loop are arranged reasonably for the loop that cannot be unfolded, so as to improve the flow efficiency to a greater extent. The C6000 compiler also provides a number of intrinsics (intrinsics) that map directly to the corresponding assembly instructions, improving program efficiency. At the same time, you can use the Pragma Directive to provide some prior knowledge to the compiler to improve compilation efficiency. If #p ragma (minimum value, maximumvalue, factor) is used to indicate to the compiler the information of the loop execution, this is convenient for the compiler to optimize by using techniques such as data packing. Taking the dev16 function for calculating the pixel-to-mean deviation in the macroblock as an example, after the above method is rewritten, the number of function execution cycles is changed from 277 cycles to 130 cycles (same under o3 condition), and the performance is improved by more than 50%.

(3) Linear assembly rewriting Linear assembly is a programming language between C and assembly language optimized for the structural characteristics of C6000. Its compilation efficiency can reach more than 90% of assembly code. At the same time, the C64x series of DSPs add a number of unique instructions for image and video applications, making the code writing efficiency of these applications improved. For example, in the ASP video encoder, the avgu4, shrmb, unpklu4, and unpkhu4 instructions used for half-pixel interpolation are used to calculate the dotpu4 and subabs4 instructions used in SAD, the SPACK2 instructions used in image reconstruction, and so on. It also facilitates the writing of code, such as the LDNDW instruction for reading pixel values ​​in the reference image frame during ME (Motion Estimation Motion Estimation), which solves the problem that the data in the reference image does not satisfy the double word alignment. The following is a code that rewrites the function transfer_16 to8copy( ) through linear assembly. Under the o3 option, the linear assembly code only requires C. code of 15.8% of the instruction cycle. Table 1 shows the performance comparison before and after partial code rewriting (same under the o3 optimization option).

. global _transfer_16 to8copy
_transfer_16 to8copy: . cp roc dst, src, stride
Reg pdst, p src, count
Reg ahi: alo, bhi: blo, chi: clo
Mvk 8, count
Mv dst, pdst
Mv src, p src
Loop: . trip 8, 8
Lddw 3 *psrc, ahi: alo
Spacku4 ahi, alo, blo; keep the value
In the range 0 - 255
Lddw 3 *+psrc (8) , chi: clo
Spacku4 chi, clo, bhi
Stdw bhi: blo, 3 pdst
Add pdst, stride, pdst
Add p src, 16, p src
[ count ] sub count, 1, count
[ count ] b loop
Endp roc

3. 5 Data Movement Optimization Due to the limited on-chip storage space, data such as reference images and reconstructed images can only be placed in external SDRAM, but it also causes huge overhead when accessing external memory. The EDMA and QDMA of the C64x only require several clock cycles for parameter initialization, and high-speed data movement can be performed in the background of the CPU, which improves the program execution efficiency. For simple data movement, you can use the DAT function provided by the CSL library. Take a simple 2D data move as an example, and give the implementation code after using QDMA:
Unsigned int transferID = DAT_open (DAT_CHAA-NY, DAT_PR I_LOW, DAT_OPEN_2D);
DAT_copy2d (DAT_2D2D, con, ref, 16, 16, width) ;
DAT_wait ( transferID) .

For complex data movement, multi-channel EDMA can be used. EDMA provides a mechanism for linking and chaining. After part of the data movement is completed, the EDMA link or channel parameters are automatically updated and loaded without CPU intervention, which is especially suitable for a large amount of data movement. However, it should be noted that since the data to be moved in the SDRAM has a copy in the L2CACHE, it is necessary to perform a coherence operation on the data to be moved in the L2CACHE and the SDRAM before the data is moved, otherwise the correct operation will not be obtained. result.

4 Experimental Results and Analysis The MPEG-4 video encoder was simulated on the C6416 DSK by the software optimization method mentioned above. In order to obtain coded information, such as peak signal-to-noise ratio (PSNR), the calc_p snr ( ) function is temporarily added to the code to facilitate performance comparison between the ASP encoder and the SP encoder. Taking the CIF format foreman video sequence of 352 × 288 as an example, when the coding rate is 256 K, the ASP encoder and SP encoder supporting GMC, QPEL and B-VOP respectively and supporting the above three tools are performed. Performance comparison (SP encoding form is "IPPPP.", ASP is "IBBPBB- when using B-VOP"
PBBP.").

Table 2 shows the length of the obtained encoded file. It can be seen that the ASP encoder has less storage space requirements than the SP encoder, and the image quality does not change much, so it is more suitable for embedded applications such as digital cameras.

Figure 1 compares the ASP encoder (supporting B-VOP, GMC, and QPEL) with the SP encoder. It can be seen that the PSNR performance is flatter than the latter, the mean square error is small, and the image quality is more stable.

Figure 1 Comparison of PSNR performance of foreman sequence ASP and SP video encoders

Although the compression efficiency is improved, the amount of calculation is increased, and since the B-VOP is used for encoding, the backward prediction is increased, the coding delay is increased, and the image frame rate is lowered.

5 Conclusion

Because ASP video encoder has higher compression efficiency, although the encoding speed is reduced and the delay is increased, it can still be encoded in real time on the DSP, so it is suitable for occasions with limited storage capacity (such as digital camera, Applications in areas such as video surveillance networks.

Sign Hardware

Door Sign Hardware,Door Sign,Moddern Door Sign Hardware

Ruixiang M&E Co., Ltd. , http://www.gzcurtainmotor.com