Abstract

Video coding standards such as MPEG-x and H.26x incorporate variable block size motion estimation (VBSME) which is highly time consuming and extremely complex from hardware implementation perspective due to huge computation. In this paper, we have discussed basic aspects of video coding and studied and compared existing architectures for VBSME. Various architectures with different pixel scanning pattern give a variety of performance results for motion vector (MV) generation, showing tradeoff between macroblock processed per second and resource requirement for computation. Aim of this paper is to design VBSME architecture which utilizes optimal resources to minimize chip area and offer adequate frame processing rate for real time implementation. Speed of computation can be improved by accessing 16 pixels of base macroblock of size 4 × 4 in single clock cycle using z scanning pattern. Widely adopted cost function for hardware implementation known as sum of absolute differences (SAD) is used for VBSME architecture with multiplexer based absolute difference calculator and partial summation term reduction (PSTR) based multioperand adders. Device utilization of proposed implementation is only 22k gates and it can process 179 HD (1920 × 1080) resolution frames in best case and 47 HD resolution frames in worst case per second. Due to such higher throughput design is well suitable for real time implementation.

1. Introduction

Digital video processing has been applied to a large number of consumer electronics products such as digital video recorders (DVR), personal digital assistants (PDA), digital cameras, and set top boxes. Motion estimation (ME), which plays most important role in video compression, is applied to evaluate the movement of blocks in the current frame. It aims to remove temporal redundancies that exist in video sequences, which results in substantial bit rate reductions. The block matching algorithm (BMA) is widely adopted for ME as it fits well with rectangular video frames as well as block based transforms and provides a reasonably effective temporal model.

In BMA, previous frame is considered as reference frame and frame is called current frame. Macroblock (MB) of size from current frame will look for its best match in region having maximum probability called search region in reference frame. Usually size of search region is considered as [, ] in as well as in direction which results in evaluation of ()2 candidate macroblocks. The difference between the coordinates of current macroblock from current fame and best match candidate macroblock from reference frame is called displacement vector or motion vector (MV). Popular cost function in hardware implementation to identify best match is sum of absolute differences (SAD) which is described by

Existing video coding standards offer variable block size video motion estimation to improve quality of encoding. Variable block size (VBS) motion compensated prediction (MCP) provides significant rate distortion performance gain over conventional fixed block size MCP but it involves massive computation and adds an extra burden to any ME architecture, in the form of additional hardware complexity, extra computation time, or a combination of both. In H.264 standard of compression a typical macroblock has a dimension of 16 × 16 pixels which can be segmented in the smallest block size of dimension of 4 × 4 (base block) as shown in Figure 1. This division is represented as macroblock mode in Figure 1 and hence VBSs contain 16 × 16, 16 × 8, 8 × 16, 8 × 8, 8 × 4, 4 × 8, and 4 × 4 size blocks which results in 41 possible combinations of variable size. Due to block size ranging from 64 × 64 to 4 × 4 in recently developed HEVC standard, there are multifarious combinations of variable size.

To generate SAD value for all 41 possible combinations of 16 × 16 macroblock, 256 pixels are processed for current macroblock as well as for each candidate macroblock. There are several overlapping candidate macroblocks depending on the size of search area memory. Before SAD computation, reading pixels of macroblocks from different memory is most significant task. To serve the purpose, raster scan [4], meander scan [5], z scan [3], or spiral scan patterns are used. Based on pixel reading mechanism, architecture will perform absolute difference and accumulation of difference, and finally comparator will identify which block size is best suited for particular macroblock among various candidate macroblocks. In this paper Section 2 surveys existing VBSME architectures and their scanning patterns. Architecture based on z pattern is presented in Section 3. Section 4 describes simulation and synthesis results and comparison with existing architecture which is followed by conclusion.

2. Macroblock Scanning Pattern and VBSME Architectures

There has been large development done by researchers in the field of variable size block matching. VBSME with 41 possible combinations of variable size is highly time consuming and quite complex from hardware implementation perspective due to huge computation. In this section existing architectures for VBSME are discussed. Full search VBSME architectures [29] are able to perform a full motion search on various size of macroblocks.

VBSME unit initially reads current macroblock from current frame and candidate macroblocks from reference frame, divided into 3 stages. The very 1st stage is used to compute absolute difference between corresponding element of current macroblock data and reference macroblock data. The second stage is to calculate intermediate results to generate 41 different SAD values. The data is partially stored in buffer and also forwarded to third stage which is used to generate all SAD values which are useful for the generation of MVs. Various architectures with different scanning pattern gives a variety of performance results for motion vector (MV) generation showing tradeoff between macroblock processed per second and resource requirement for computation. To generate SAD value for all possible combinations of macroblocks all pixels are read using traditional raster scan pattern for 16 × 16 macroblock as shown in Figure 2 for architectures presented in [2, 4, 6, 7]. On the other hand, architectures presented in [5, 9] use meander scan and architecture presented in [3] uses z scan pattern as shown in Figure 3. Based on pixel reading mechanism architecture will perform absolute difference and accumulation of difference and finally comparator will identify which block size is best suited for particular macroblock among various candidate macroblocks.

16 × 16 macroblock can be segmented into 16 small blocks of size 4 × 4 as indicated in Figure 4 where various small blocks are labels with b0 to b15. In horizontal raster scan pattern of Figure 2(a), first row of blocks b0, b1, b2, and b3 are read while in vertical raster scan pattern of Figure 2(b) first column of blocks b0, b4, b8, and b12 are read. However both types of scan, horizontal and vertical, provide same results in context of resource utilization as well as number of clock cycles required for reading pixels. In VBSME architectures 1, 4, or 16 pixels are read simultaneously and processed in processing elements (PEs) to generate SAD combinations. For parallel processing of pixels architectures prefer multiple PEs which can be 4, 16, 64, or even 256. Most of architectures use 16 × 16 search range which is extended to 32 × 32 in few of the architectures. The VBSME architecture presented in [2] is based on 16 PEs. The current macroblock data is arranged in a raster scan sequence and search region data is arranged in a dual raster scan sequence. 16 SAD values are being computed, each with block size 4 × 4. The stored SAD values are then reused to compute SAD values for other block sizes. This is done by shuffling and combining the computed subblock SAD values appropriately to derive SAD for each of the other larger block sizes. This avoids the need to compute each of these from scratch and allow up to 41 SAD values to be processed in a single processor. Architectures presented in [24] read single pixel at a time and can process only one pixel of current macroblock and candidate macroblock using particular PE in single clock cycle and hence consume 282 clock, 271 clock, and 262 clock cycles, respectively, to generate 41 SAD combinations. Architecture presented in [4] uses 18 × 1 multiplexers as well as latches and eliminates the intermediate buffer requirement need compared to architecture presented in [2]. PEs are arranged in 4 × 4 array in architecture explained in [3] and it uses single pixel z scan for reading pixel from reference and current frame. The pixel values are fed through shift registers to 16 PEs which are arranged in 4 × 4 array. Concept is replicated several times to compute multiple candidate macroblocks in given search window. By using scanning pattern of [4] and reading 4 pixels at a time clock cycles required to generate 41 combinations reduce to 70 which is approximately 4 times lesser as indicated in [7]. Same author has also presented the extended version of architecture for 16-pixel processing in which the number of clock cycles required to generate the same 41 combinations is reduced to 20 which is lesser by factor 16. Architecture proposed in [5] deals with 16 pixels at each clock cycle with 16 computing units. Each computing unit has 16 PEs. Thus total 256 PEs are used for generation of SAD values for 16 × 16 macroblock size. It uses meander like scan pattern for search area. After surveying various architectures, with variety of scanning patterns we can summarize that at least 20 clock cycles are needed to compute 41 SAD combinations.

3. Proposed Architecture

3.1. Pixel Reading Pattern

In this section VBSME architecture is presented with aim of generating 41 SAD combinations of variable size macroblock in optimal clock cycles with reduced resource utilization. Instead of using conventional raster scan pattern, proposed architecture uses z scan pattern, to read 16 pixels at a time from memory as shown in Figure 4. Due to such pattern smallest block of size 4 × 4 can be read at a time. Once base block is available in very next cycle SAD for that block is computed. Hence in two clock cycles blocks b0 and b1 are available and first 4 × 8 combination can be computed. Such scanning pattern will eliminate need of storing pixel values of intermediate row or column.

3.2. Architecture Description

Figure 5 shows multiple processing elements (PEs) of proposed VBSME architecture. Each PE computes 41 SAD combinations of current macroblock and corresponding candidate macroblock from reference memory called reference memory block (RMB). For window size of there will be ()2 candidate RMBs that need to be processed. By choosing , architecture can calculate SAD of current macroblock and () RMBs together and by repeating process () times SAD values for all candidate macroblocks are available. Figure 6 shows location of RMBs for various processing unit and Table 1 shows the data scheduling for the proposed architecture with 17 PEs.

As shown in Table 1, in very 1st cycle submacroblock b0 is read from both reference and current memory and fed to the processing element PE0. At the same time all other PEs also get same submacroblock from current memory but 1 column shifted submacroblock from reference memory. Due to proposed scanning pattern sixteen pixels are scanned together and their SAD values will be available in next clock cycle. Buffer is needed to store SAD value of this smallest size 4 × 4 submacroblock.

The processing element used in Figure 5 is represented in detail in Figure 7. The architecture is divided into multiple stages, namely, absolute difference calculation (ADC), addition of absolute difference, and generation of 41 SAD combinations. To compute absolute difference, multiplexer based ADC presented in [10] and, for addition of operands, adder presented in [11] are used. 16 reference macroblock pixels and 16 current macroblock pixels are fed to the ADC unit and result is forwarded to adder block. Adder block sums up all the difference values and stores them to the respective intermediate buffer labelled as b0 to b15. 1 × 16 demultiplexer is used to select respective buffer to compute 4 × 8, 8 × 4, 8 × 16, 16 × 8, and 16 × 16 combination further using multilevel addition. Summation of macroblock sizes less than 16 × 16 is kept on respective data buses for further computation and finally 41 combinations for VBSME are ready.

At the end of 16 clock cycles according to schedule of Table 1 all 4 × 4 submacroblocks are read and their individual SAD values are available as shown in Table 2. At very next, that is, on 17th clock, the remaining 25 combinations are computed. Thus all 41 SAD values are available in total 17 clock cycles in all PEs. Immediately RMBs are shifted to next rows and computation of () combinations of that particular row is started.

Once all SAD values are available in () PEs, comparators identify best possible combination for () RMBs which is stored and compared with next row of RMBs. After evaluation of all ()2 RMBs, best match macroblock is identified which is followed by motion vector computation. Then, next macroblock from current frame is evaluated. Latency between two consecutive macroblocks of current frame depends on time required to read search area. Due to 128-bit data bus 16 pixels are read from reference frame concurrently, which takes 48 clock cycles for very first macroblock and 64 clock cycles for the rest of the macroblocks if single search area memory is used. In this work three search area memories are incorporated which are used in round robin fashion. When is chosen, then 50% search areas for two consecutive macroblocks are overlapped; hence at the time of filling one memory, pixels are filled in next memory also. Due to this arrangement, at the time of motion vector computation for any macroblock, search area memory is prepared for next macroblock; hence there is no latency between successive macroblocks.

3.3. Synthesis Results of Proposed VBSME Architecture

Proposed VBSME hardware architecture is implemented and tested in terms of various evaluation metrics. Architectures have been implemented using VHDL and synthesized using Xilinx FPGA family Spartan3 and Virtex5 with chip XC3s400 and XC5vlx50, respectively. Current memory size is chosen as 16 × 16 pixels due to macroblock size of 16 × 16 while reference memory size is 32 × 32 pixels by considering search window parameter as 8. Table 3 shows macrostatistics for proposed implementation. Architecture is optimized for adder subtractors and other resources hence demonstrating very low gate count of only 22k. Synthesis delay of design is only 2.543 ns offering maximum frequency of 393.16 MHz. At maximum frequency it can process 179 HD (1920 × 1080) frames in one second. Post place and route delay is 9.72 ns which is considered as worst case delay in which 47 HD (1920 × 1080) frames can be processed per second at frequency of 102 MHz.

Table 4 indicates the comparison between the existing VLSI implementation of VBSME and proposed implementation. Similar comparison between the existing FPGA implementation of VBSME and proposed implementation is shown in Table 5. Most of architectures are implemented with variable block sizes from 16 × 16 to 4 × 4 presented in [14] which is limited to block size between 16 × 16 and 8 × 8. Architectures presented in [7, 16] are demonstrated for search range 16 × 16; therefore they can evaluate only one candidate macroblock. The rest of architectures are tested with search range 32 × 32 or 33 × 33. Most of VLSI implementations are 180 nm or 130 nm technology while FPGA implementations are using Virtex series. Implementation parameters like search area, pixel scanning pattern, data bus width to read pixels, and number of PEs are diverse for various designs; hence to evaluate their performance number of macroblocks processed per second and frame processing rates are an important criterion.

The architecture proposed in this design works on 16 pixels’ scanning which results in higher throughput compared to not only 1-pixel scan and 4-pixel scan architecture but also existing 16-pixel scan architectures. In comparison with 16-pixel raster scan architecture of Warrington et al. [7] proposed architecture can process 3 times more HD frames even in worst case and offers 7 times lesser gate count while compared to 16-pixel meander scan architecture of Wei et al. [5] it can process more than 2 times HD frames with 16 times less processing elements. Gate count of López et al. [6] architecture is comparable with proposed architecture but it offers frame rate of only 60 fps for CIF resolution which in actuality is very less. Gate count of [15] is lesser compared to proposed design but frame processing rate is not given and therefore is not adequate for comparison. Architecture presented by Olivares [12] can process 21.42 HD (1920 × 1080) resolution frames with 256 PEs; still this frame rate is not sufficient for real time implementation. From comparison among FPGA implementation of VBSME architectures also we can observe that number of LUTs used by proposed design is higher but at same time design offers higher frame processing rate. From overall comparison with various 16 pixels’ scan architectures we can derive that proposed architecture outperforms in terms of throughput.

For the advance comparison of architecture, in addition to frame processing rate, hardware efficiency [5] is used which is defined as the ratio of data throughput rate TP over hardware cost in terms of resource utilization or gate count. TP is defined by the number of macroblocks processed by architecture per second. Equation (2) indicates hardware efficiency and its unit is macroblocks per second per gate. To evaluate the architecture efficiency in terms of power, can be defined as ratio of TP over the power as shown in (3). Unit of is macroblocks per second per mW. With higher and , architecture is more efficient.

As per (2) and (3) hardware and power efficiency are computed for existing and proposed VBSME implementation and shown in Table 6. Hardware efficiency of proposed architecture in comparison with existing architectures is more than 5 times enhanced in worst case while it is more than 19 times superior in best case. In terms of power efficiency, proposed implementation produces similar results as implementation presented by Fatemi et al. [13]. Other than that power efficiency of proposed architecture is better than other architectures in best case. In comparison of some of the architectures, proposed design uses somewhat more gates but throughput of proposed design is higher compared to all existing architectures. Overall comparison indicates that proposed VBSME architecture is hardware efficient and power efficient.

4. Conclusion

In this paper, architecture for full search variable block size motion estimation is described. Architecture makes calculation for all 41 combinations of variable block size motion vector considering 289 candidate macroblocks in search area of 32 × 32. Architecture described in this paper uses 16-pixel z scan pattern to access pixels of current macroblock and 17 candidate macroblocks and can compute all 41 combinations of 16 × 16 macroblock in only 16 clock cycles. Process is repeated 17 times using 17 processing elements, hence in 272 clock cycles all the combinations of all candidate macroblocks are available based on which best match and motion vector is computed. Device utilization of proposed implementation is only 22k and it can process 179 HD (1920 × 1080) resolution frames in best case and 47 HD resolution frames in worst case per second. Implementation results show that proposed VBSME architecture outperforms in area utilization compared to existing 1-pixel scan, 4-pixel scan, and 16-pixel scan architectures due to 16-pixel z scanning pattern. VBSME architecture demonstrates 19 times better hardware efficiency in comparison with other VBSME implementations. Power efficiency of proposed VBSME architecture is either better or comparable with existing implementations. Architecture can be configured with more PEs to suffice need of extended search area. With adequate frame processing rate architecture is well suited for real time implementation.

Competing Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.