Abstract

This paper proposes and evaluates the LFrWF, a novel lifting-based architecture to compute the discrete wavelet transform (DWT) of images using the fractional wavelet filter (FrWF). In order to reduce the memory requirement of the proposed architecture, only one image line is read into a buffer at a time. Aside from an LFrWF version with multipliers, i.e., the LFr, we develop a multiplier-less LFrWF version, i.e., the LFr, which reduces the critical path delay (CPD) to the delay of an adder. The proposed LFr and LFr architectures are compared in terms of the required adders, multipliers, memory, and critical path delay with state-of-the-art DWT architectures. Moreover, the proposed LFr and LFr architectures, along with the state-of-the-art FrWF architectures (with multipliers (Fr) and without multipliers (Fr)) are compared through implementation on the same FPGA board. The LFr requires 22% less look-up tables (LUT), 34% less flip-flops (FF), and 50% less compute cycles (CC) and consumes 65% less energy than the Fr. Also, the proposed LFr architecture requires 50% less CC and consumes 43% less energy than the Fr. Thus, the proposed LFr and LFr architectures appear suitable for computing the DWT of images on wearable sensors.

1. Introduction

1.1. Motivation

The availability of low-cost small-sized cameras attached to wearable sensors and portable imaging devices has opened up a wide range of imaging-oriented applications, including assisted living, smart healthcare, traffic monitoring, virtual sports experiences, and posture recognition [112]. An interconnection of visual sensor nodes (sensor nodes with attached camera) is known as visual sensor network (VSN) [13, 14] or as wireless multimedia sensor network (WMSN) [15, 16]. Wearable visual sensors may also be a part of the Internet of things (IoT) [1721]. Low-cost IoT wearable sensors [22] enable a wide range of activities for the benefit of society, e.g., hazard avoidance systems for worker safety [23], navigation aids for visually impaired individuals [24], activity monitoring [25], smart irrigation [26], and sports [27].

In many visual applications of wearable sensors and portable imaging devices, images captured by the camera need to be transmitted wirelessly to a body-worn or nearby hub device. The wearable sensors and portable imaging devices have limited resources, and the wireless links have narrow bandwidth [28], making it impossible to directly send the raw (uncompressed) images. Thus, there is a need to compress the images before transmission [29]. Therefore, an image coder is needed in order to compress the images. In an image coder, an image is generally first transformed using the discrete cosine transform (DCT) [30] or discrete wavelet transform (DWT) [31, 32] and then it is quantized and entropy coded. The DWT, which is also used in JPEG 2000 [33], is popular in a wide variety of applications, including activity monitoring [34], fault detection in inverter circuits [35], medical imaging [36], image denoising [37], image recognition [38], image reconstruction [39], watermarking [40], computer graphics, and real-time processing [41] due to its multiresolution feature and excellent energy compaction properties [42, 43].

The hardware architectures for wearable visual sensors and portable imaging devices in the IoT and wireless multimedia sensor networks should require minimal hardware resources and consume low energy for a small form factor and long battery life [44, 45]. Generally, the computational capabilities of visual sensor nodes have been increasing in recent years [46]. Nevertheless, due to the economic pressures on visual sensor designs and despite the emergence of specialized hardware acceleration, e.g., FPGA and, components [4749], the computational resources of visual sensors will likely remain scarce. Emerging computing and communication paradigms, such as mobile ad hoc cloud computing [50, 51], expect the nodes to not only transmit sensed images but also to participate in some service computing functions, e.g., for localized image analysis and decision making, which can be orchestrated through software-defined networking and control structures [5254]. In order to make the economical functioning of wearable visual sensors in such networked systems feasible, the resource usage of the image coding and transform must be very low. In particular, as the DWT is an important component of an image coder for visual sensors, the DWT hardware architecture should have minimal area and energy consumption.

1.2. Related Work

The conventional convolution-based DWT computation of an image requires a huge amount of memory due to its row- and column-wise scanning [55, 56], making it unsuitable for memory-constrained wearable sensors. The different low-memory architectures reported in the literature for computation of DWT can be categorized as line-based architectures [57], stripe-based architectures [58, 59], block-based architectures [60, 61], and the fractional wavelet filter (FrWF) architecture [62]. For an image of dimension pixels, the line, stripe, and block-based architectures require random access memory (which we refer to as RAM or memory for brevity) in the range of to words, while the FrWF architecture requires words of RAM [62].

Another low-memory pipeline-based architecture has been proposed in [63]. However, the design in [63] is based on the nonseparable DWT computation approach, which is unpopular because of its higher computational requirements than the conventional separable approach. It is a well-known fact that at a particular throughput, the separable 2D DWT computation approach is computationally more efficient than the nonseparable approach [64]. A dual data scanning-based DWT architecture is reported in [65]. In this architecture, several 2D DWT units are combined into a parallel multilevel architecture for computing up to six DWT levels. However, this architecture needs words of memory. An architecture based on an interlaced read scan algorithm (IRSA) is proposed in [66] in conjunction with a lifting-based approach with a 5/3 filter-bank which requires words of memory. However, the long critical path delay (CPD) of (where is the multiplier delay and is the adder delay) of the architecture in [66] may limit its use in real-time applications.

An LUT-based lifting architecture for computing the DWT has been reported in [67]. The design [67] has low area and power requirements. However, it has a long CPD equal to (where is the look-up table (LUT) delay, bits is the word length, and is the full adder delay). A lifting-based architecture for computing both the 1D and 2D DWT has been presented in [68]. However, this design uses a transpose buffer of size . An energy-efficient block-based DWT architecture has been proposed in [61]. However, this architecture requires a large number of multipliers, namely, 16 and 36 multipliers for 5/3 and 9/7 filters, respectively. Another energy-efficient lifting-based reconfigurable DWT architecture has been proposed in [69], mainly for medical applications. However, the frequency of operation of this architecture is limited to 20 MHz. An energy-efficient lifting-based configurable DWT architecture for neural sensing applications has been proposed in [70], requiring 12 adders and 12 multipliers. However, its operating frequency is limited to only 400 KHz and 80 KHz for the gating and interleaving architectures used in the main architecture, respectively.

A power-efficient modified form of the DWT architecture has been presented in [71], using Radix-8 booth multipliers. This architecture uses bit truncation to reduce the area and power. However, bit truncation degrades the quality of the reconstructed image when the inverse DWT is applied. There have been some DWT implementations on graphics processing units (GPUs) [7278]; however, GPUs are relatively expensive for low-cost sensing platforms.

The recently proposed FrWF architecture requires only words of memory and has a CPD equal to the delay of a multiplier [62]. A multiplier-less FrWF architecture was also reported in [62] which reduces the CPD to the delay of an adder, . However, the FrWF architecture (with and without multipliers) has high energy consumption owing to its large number of compute cycles. The high energy consumption of the FrWF architecture may be prohibitive for wearable sensors and portable imaging devices with tight memory and energy constraints [79].

1.3. Contributions and Structure of This Article

This paper proposes the LFr, a novel lifting-based energy-efficient architecture to compute the DWT coefficients of an image with a 5/3 filter-bank. At the core of the proposed LFr architecture is a novel basic Lift_block that computes the and subband coefficients with only two two-input adders and one multiplier (plus two pipeline registers), thus greatly reducing the hardware requirements compared to prior convolution architectures. Moreover, a multiplier-less implementation of the proposed architecture, denoted by LFr, is designed. The multiplier-less LFr has a shorter CPD than the proposed multiplier-based LFr architecture. The proposed LFr and LFr architectures are not only efficient in terms of energy but also require fewer adders, multipliers, and registers than the state-of-the-art FrWF architectures (with multipliers (Fr) and without multipliers (Fr)). We compare the proposed architectures with state-of-the-art DWT computation architectures in terms of the required adders, multipliers, memory, and critical path delay. We also implement the proposed architectures and the state-of-the-art FrWF architectures on the same FPGA board. Experimental results demonstrate that the proposed LFr and LFr architectures have lower hardware resource requirements and energy consumption than the state-of-the-art Fr and Fr architectures.

The remaining part of the paper is arranged as follows. Section 2 gives a brief overview of the DWT and FrWF techniques. The proposed lifting-based LFrWF architecture is described in detail in Section 3 along with its memory requirement. The evaluation results along with related discussions are presented in Section 4. Finally, Section 5 concludes the paper.

2. Background

This section briefly reviews the DWT and FrWF techniques along with FrWF architecture. The main notations used in this article are summarized in Table 1.

2.1. Discrete Wavelet Transform (DWT)

The most popular approach for computing the two-dimensional (2D) DWT of an image is the separable approach, in which the rows are filtered first, followed by column-wise filtering of the resulting coefficients. When a row is convolved (filtered) by a low-pass filter (LPF) and a high-pass filter (HPF), followed by downsampling by a factor of two, the results are known as approximation and detail coefficients, respectively. For a 1D signal of dimension , which we consider as a preliminary step for computing the 2D DWT, there are approximation coefficients and detail coefficients. Combining the downsampling with the convolution operation, the approximation coefficients and the detail coefficients for can be expressed mathematically as [55]respectively, whereby and denote the LPF and HPF coefficient, respectively, denotes the signal sample, while and are the number of LPF and HPF coefficients, respectively. The largest integer less than or equal to is denoted by the symbol .

In the separable approach, all image rows are first convolved separately by a HPF and a LPF, followed by downsampling with a factor of two, resulting in the and subbands. Then, the columns of the and subbands are convolved by a HPF and a LPF, followed by downsampling with a factor of two, resulting in the , , , and subbands [80]. However, this approach needs to save the entire image in the RAM on the sensor (board) system. Thus, this DWT computation approach requires a huge amount of memory, making this approach unsuitable for low-cost wearable sensors and portable imaging devices with limited RAM [55, 56].

The lifting scheme [81] computes the DWT of images using inplace computations which save memory. Moreover, the lifting scheme uses predict and update steps for computing the subbands. In particular, the low-pass filtered coefficients are predicted using the high-pass filtered coefficients. Thus, the lifting scheme reduces the convolution operations needed by the LPF coefficients. Hence, the lifting scheme reduces the number of arithmetic computations required for computing the image DWT [82].

The lifting scheme for a 5/3 filter-bank is shown in Figure 1. In this figure, are the input signal samples. Among these samples, , and are the even-indexed samples, while , and are the odd-indexed samples. Also, and are the high-frequency and low-frequency lifting parameters, respectively; and are the scaling parameters, whereby , , and [66]; , , and are the high-frequency wavelet coefficients; while , , , and are the low-frequency wavelet coefficients. The high- and low-frequency wavelet coefficients are computed following the diagram in Figure 1; for instance,

It should be noted that the arrows without an associated symbol in Figure 1 have the unit multiplication factor, i.e., 1.

2.2. Fractional Wavelet Filter (FrWF)

The FrWF is a low-memory DWT computation technique [56]. It uses a specific image data scanning technique in order to reduce the memory required for computing the DWT. It selects a vertical filter area (VFA), scanning rows of the image from an SD-card (where is the number of LPF coefficients). The rows in a VFA are read in raster scan order. Once the reading of all the image rows in a VFA is complete, the VFA is shifted by two lines in the vertical direction. This shifting of the VFA is done in order to incorporate the dyadic downsampling. One line of the , and subbands is computed from one VFA. All the image lines are covered by shifting the VFA. The VFA will be shifted times for an image of dimension . The FrWF has been combined with a low-memory image coding algorithm to design an efficient image coder for WMSNs in [83].

An FPGA architecture for the FrWF with a 5/3 filter-bank has been proposed in [62]. This FrWF architecture, which follows the FrWF data scanning order, requires words of memory and a total of compute cycles. The large number of compute cycles results in a high energy consumption, which may be prohibitive for resource-constrained wearable visual sensors and portable imaging devices. The proposed LFrWF focuses on reducing the energy consumption for computing the DWT of images.

3. Proposed LFrWF Low Energy Architecture

This section presents the proposed LFrWF lifting-based architecture to compute the DWT of an image using the FrWF approach with a 5/3 filter-bank.

3.1. Data Scanning Order

The proposed lifting-based architecture follows the data scanning order of the FrWF algorithm [56]. It is assumed (as is common for low-memory implementations of the DWT computation) that the original image is stored on an SD-card; throughout, the SD accesses are appropriately buffered to compensate for the latencies of the SD-card accesses. Initially, a vertical filter area which spans image lines ( is the number of LPF coefficients) is marked in the SD-card. The rows of the image are read in raster scan order from the VFA, one line at a time into the RAM buffer P_store (as shown in Figure 2). After the processing of all the rows of the VFA is completed, the VFA is shifted down by two lines and the new rows are again read into buffer P_store in raster scan order. The complete image is read by repeatedly shifting the VFA downwards by two lines until all the rows are read. In the proposed architecture, one complete line is read at a time and scanned in raster order; in contrast, the FrWF architecture in [62] reads only 5 coefficients of an image line at a time.

3.2. Proposed Lifting-Based LFrWF Architecture

This subsection describes the proposed lifting-based DWT architecture in detail.

3.2.1. Top-Level Architecture

Figure 2 shows the top-level block diagram of the proposed LFrWF architecture. The LFrWF architecture works as follows. First, the input image pixels of a line are read into the register P_store. This P_store register stores the original image pixels of 8 bits each. The pixels of the image from P_store are sent to the Lift_block (as detailed in Figure 3) to compute the and subband coefficients using the lifting scheme. The generated and subband coefficients are saved in the register 1D_store. The contents of the 1D_store register are used as inputs for the Conv_block (as shown in Figure 4), which generates intermediate coefficients that are saved in the HH_store, HL_store, LH_store, and LL_store registers. These intermediate values are successively updated by the next image lines. The intermediate values in the registers HH_store, HL_store, LH_store, and LL_store, after updating, will give the values of the , and subbands, respectively. Once the final subband coefficients of the , and subbands are computed, they are transferred and saved in an external SD-card. The functioning of the different blocks leading to the computation of the subbands is described next.

3.2.2. Lifting Block

In the lifting scheme with a 5/3 filter-bank, two previous high-pass filtered coefficients are used to predict a low-pass filtered coefficient. For the efficient implementation of the lifting scheme, we introduce a novel basic Lift_block. As illustrated in Figure 3, the basic Lift_block computes two subband coefficients and one subband coefficient from a group of five input pixels in three steps. The inputs (Input1, Input2, Input3, and Liftpar) and output (Out1) of the adders and multiplier to be used in Figure 3 for the different steps are shown in Table 2. The first two steps compute two coefficients of the subband and the third step computes a coefficient of the subband. In Table 2, , and are the first five pixels of an image line. and are the first two high-pass filtered coefficients which are stored as the first two elements of the register 1D_store. is the first low-pass filtered coefficient and is stored as the third element of the register 1D_store. The high-pass filtered coefficients ( and ) and the low-pass filtered coefficient () are computed aswhere and are lifting parameters [66]. Once the five pixels (, and ) are processed, the first two pixels are discarded and two new pixels are read along with the previous last three pixels. The same procedure, in equations (3)–(5), is repeated on these new pixels to compute the and subband coefficients.

The basic Lift_block in Figure 3 requires two two-input adders and one multiplier. The functionality of this basic Lift_block essentially replaces the functionality of the convolution stage-1 block in the Fr architecture, as shown in Figure 3 in [62] and elaborated in Figures 4–7 in [62]. For an LPF length of and an HPF length of , the convolution stage-1 block in [62] requires two-input adders and multipliers for the low-pass filtering as well as two-input adders and multipliers for the high-pass filtering. Thus, for a 5/3 filter, the Fr convolution stage-1 block requires six adders as well as eight multipliers.

3.2.3. Convolution Block

In the Conv_block in Figure 4, the subband coefficients from the 1D_store register are multiplied by a suitable HPF and LPF coefficient (as determined by a multiplexer) and then added/stored with the previous value in the registers HH_store and HL_store, respectively. Similarly, the subband coefficient in the 1D_store register is multiplied by a suitable HPF and LPF coefficient (as determined by a multiplexer) and then added/stored with the previous value in the registers LH_store and LL_store, respectively. The values in the registers HH_store, HL_store, LH_store, and LL_store are updated to compute the coefficients of the , and subbands, respectively.

We note that the Conv_block in Figure 4 is essentially equivalent to the aggregation of the FrWF convolution stage-2 blocks in Figures 4–7 in [62]. The Conv_block in Figure 4 requires four two-input adders and four multipliers. On the other hand, the aggregation of the FrWF convolution stage-2 blocks in Figures 4–7 in [62] requires two two-input adders and two multipliers.

3.2.4. Pipeline Registers

The Lift_block and the Conv_block use two and four pipeline registers, respectively, to temporarily save the intermediate results after each compute cycle. Through the use of the pipeline registers, the critical path delay (CPD) of the proposed LFrWF architecture becomes the multiplier delay .

Overall, for a 5/3 filter, considering both the basic Lift_block (Figure 3) and the Conv_block (Figure 4), the proposed LFr requires six two-input adders and five multipliers compared to eight two-input adders and ten multipliers of the Fr architecture (Figures 4–7 in [62]).

The proposed LFrWF architecture stores the original image and the subbands in the SD-card. Thus, higher wavelet decomposition levels can be computed with the same architecture, whereby the subband coefficients are taken as input.

3.3. Proposed Multiplier-less LFr Implementation

The 5/3 filter-bank coefficients (shown in Table 3) and the 5/3 filter-bank lifting parameters involve integer division and multiplication. Thus, they can be implemented using the shift and add method. More specifically, the convolution with the 5/3 filter-bank requires only integer multiplication and division and can therefore be implemented with only shift and add operations. For example, , i.e., shifting the number two times to the right is equivalent to dividing by 4. The shift and add concept, as applied to the 5/3 filter coefficients, operates as follows:(1)The filter coefficient can be implemented by three right shift operations, followed by a complement operation(2)The filter coefficient can be implemented by two right shift operations(3)The filter coefficient can be implemented by two right shift operations, followed by addition with one right shift(4)The filter coefficient can be implemented by one right shift operation, followed by a complement operation(5)The coefficient , thus, no shifting is required

With these specified shifting operations, the convolution block can be simplified and implemented using only shifters and adders. Multiplier-less computation blocks for the 5/3 LPF and HPF coefficients are given in Figures 5 and 6, respectively. One benefit of the multiplier-less implementation over the multiplier-based architecture in Section 3.2 is that the multiplier-less implementation reduces the CPD from the multiplier delay down to the adder delay .

3.4. Memory Requirement

In order to compute the DWT coefficients, the proposed LFrWF architecture uses four registers (HH_store, HL_store, LH_store, and LL_store), two register arrays (P_store and 1D_store), and six pipeline registers. The register array P_store (of size words) is used to store an image line. The and subband coefficients computed by the Lift_block are saved in the register array 1D_store of 3 words. The four registers HH_store, HL_store, LH_store, and LL_store are of words each. The total memory requirement of the proposed architecture is equal to the sum of all registers, i.e.,

3.5. Line Segmentation

Equation (6) indicates that LFrWF memory requirement grows with the image dimension and thus will be significantly greater than the FrWF memory requirement of words for large images. In order to reduce the memory requirement of the proposed LFrWF architectures, each image line may be segmented, as illustrated in Figure 7, with overlapping of coefficients at both boundaries of the second to the last, but one segment (the first and last segments only require overlapping at one boundary) (Appendix E in reference [88]). In this approach, only one line segment needs to be read into the register array P_store. Thus, the memory requirement of the LFrWF with line segments is

For the 5/3 filter-bank with a VFA of lines, the memory requirement is

The other resource requirements are independent of line segmentation and remain unchanged.

The line segmentation reduces the memory requirement of the proposed LFrWF architectures so that their memory requirement can be reduced below the memory required by FrWF architectures of [62]. The FrWF architecture does not include a line segmentation provision; therefore, its memory requirement cannot be reduced further. We observe from Table 4 that the memory requirements of the proposed LFrWF architectures are greater than the FrWF memory requirements. However, by incorporating the line segmentation approach, the memory requirement of the LFrWF architectures can be reduced below that of the FrWF architectures. In case of the 5/3 filter-bank, we observe from Table 4 that the memory requirement of the FrWF architectures is , while the memory requirement of LFrWF architecture with line segments is , see equation (8). Therefore, the LFrWF memory requirement is less than the FrWF memory requirement if .

4. Results and Discussion

This section presents the implementation of the proposed LFrWF architecture and its comparison with state-of-the-art architectures. First, we compare the proposed LFrWF architecture with several state-of-the-art architectures in terms of the required numbers of adders and multipliers, as well as the critical path delay (CPD) and required memory. Next, the postimplementation results of the proposed LFrWF architectures are compared with the state-of-the-art FrWF architecture [62] by implementing both architectures on the Xilinx Artix-7 FPGA platform.

4.1. Adders, Multipliers, CPD, and Memory

Table 4 compares the numbers of required adders and multipliers, as well as the CPD and the required RAM of the proposed LFrWF architectures with state-of-the-art architectures. The numbers of adders and multipliers of the existing state-of-the-art architectures shown in Table 4 have been taken from the corresponding papers. We observe from Table 4 that the proposed LFr architecture requires the least number of adders (namely, only six adders, see Figures 3 and 4) among the state-of-the-art architectures. While the proposed LFr reduces the number of required adders only by two compared to the Fr architecture, the proposed LFr reduces the number of adders down to less than half of the other prior architectures. Among the architectures using multipliers, the proposed LFr architecture also requires the least number of multipliers, namely, only five multipliers, see Figures 3 and 4. Only the RMA [85] has a similarly low multiplier requirement with six multipliers (but requires approximately twice the memory compared to LFrWF). The other prior architectures require twice or more multipliers than the proposed architecture.

We also observe from Table 4 that the CPD of the proposed LFr architecture and the Fr architecture [62] are , which is less than the architectures in [85, 86]. We note from Table 4 that the multiplier-less LFr and Fr have reduced the CPD to , which is less than the CPD of other state-of-the-art architectures. The CPD of achieved by the proposed LFr architecture cuts the shortest CPD of any existing architecture of achieved by the Aziz architecture [87] down to half. Note that the shifter delay is commonly larger than the adder delay , i.e., ; thus, the PMA architecture [85] has a longer CPD than the Aziz architecture. The benefit of the reduction in CPD is that the architectures can be operated at higher frequencies, since maximum operations frequency = 1/CPD. As the CPD decreases, the maximum operating frequency increases.

Table 4 furthermore indicates that the FrWF architecture has the lowest memory requirement. However, the memory requirement of the proposed LFrWF architecture is less than the memory requirement of the other state-of-the-art architectures in Table 4. As noted in Section 4.3, with segmentation of a line of words (pixels) into segments (of words each), the LFrWF memory requirement drops below the FrWF memory requirement if more than segments are used.

4.2. FPGA Implementation

The proposed LFrWF architecture computes the DWT coefficients of images based on lifting while following the FrWF approach. As observed from Table 4, the FrWF architecture [62] requires the least memory among the state-of-the-art architectures. Thus, we implemented the FrWF architectures [62] and the proposed LFrWF architectures (initially without segmentation, i.e., ) on an Artix-7 FPGA (family: Artix-7, device: xc7a15t, package: csg324, speed: ). The implementations used identical multipliers, adders, and other components provided by the Xilinx Artix-7 FPGA family. All architectures used an input pixel width of 8 bits and a data-path width of 16 bits. Table 5 summarizes the FPGA implementation comparison. We report averages for evaluations with seven popular (8 bits/pixel) test images, namely, “lena,” “barbara,” “goldhill,” “boat,” “mandrill,” “peppers,” and “zelda,” obtained from the Waterloo Repertoire (http://links.uwaterloo.ca). The energy consumption is evaluated by multiplying the number of compute cycles with the average power consumption and the compute (clock) cycle durations of 5.0 ns and 1.5 ns for the architectures with multipliers and without multipliers, respectively. These clock cycle durations have been selected to satisfy the CPD constraint, as given in Table 5, namely, a CPD of 4.8 ns for the design with multipliers and a CPD of 1.45 ns for the multiplier-less design. The number of compute cycles and the average power consumption were evaluated by simulation with the Xilinx Vivado software suite, version 2018.2.

We observe from Table 5 that the proposed LFr architecture requires approximately 22% less LUTs, 34% less FFs, and 50% less compute cycles, and consumes 65% less energy than the Fr architecture. Due to the reduced number of hardware components (LUTs and FFs), the area occupied by the LFr architecture will be less than the area of the corresponding Fr architecture. Moreover, the proposed multiplier-less LFr architecture requires 2.6% less FFs and 50% less cycles and consumes 43% less energy than the multiplier-less Fr architecture [62]. The proposed LFr architecture requires slightly more LUTs than the multiplier-less Fr architecture.

We also observe from Table 5 that the proposed LFrWF reduces the number of required compute cycles to roughly half the compute cycles required by the FrWF. More specifically, while the FrWF requires on the order of 10 million compute cycles for a image, the proposed LFrWF requires only a little more than 5 million compute cycles. This substantial reduction is primarily due to the computational efficiency of the novel Lift_block (see Section 3.2.2) for computing the decomposition subband coefficients.

Moreover, we observe from Table 5 that the power consumption of the proposed LFrWF architecture with multipliers is less than the power consumption of the corresponding FrWF architecture with multipliers, while the multiplier-less LFrWF and FrWF have approximately the same power consumption. The energy consumption is evaluated by multiplying clock cycle duration (which is based on the CPD) with the number of clock cycles and the consumed power. Due to the reduced (almost half) number of compute cycles and the lower (or same) power consumption, the energy consumption levels of the proposed LFrWF architectures are substantially lower than the energy consumption levels of the FrWF architectures. We further observe from Table 5 that compared to the designs with multipliers, the multiplier-less designs of both the LFrWF and the FrWF have the same numbers of clock cycles, but shorter CPD and (slightly) reduced power levels; thus, the multiplier-less designs have substantially reduced energy consumption levels.

We also observe from Table 5 that both architectures have the same CPD. We note that the numbers of hardware components, e.g., adders, multipliers, LUT, and FF, and other parameters, such as the number of clock cycles, memory, and CPD ( or ), are independent of the platform on which the design is implemented and the test image. Among the results presented in Tables 46, only the energy consumption, the power consumption, and the energy delay product (EDP) depend on the platform and image.

4.3. Line Segmentation

We observe from Table 6 that increasing the number of line segments reduces the memory requirement while increasing the number of compute cycles and the energy consumption. The compute cycle and energy consumption increases are mainly due to the overlapping of coefficients at the line segment boundaries which need to be read twice. However, for all line segmentations (), the number of compute cycles and energy consumption are less than for the corresponding FrWF architectures, see Table 5. We observe from Tables 5 and 6 that even with segments per line, the number of compute cycles and the energy consumption of the proposed LFrWF architectures are less than those for the corresponding FrWF architectures. Since the FrWF architectures of [62] read only 5 pixels at a time, the segmentation approach cannot be incorporated into the FrWF architecture. Hence, the memory of the FrWF architectures cannot be further reduced by incorporating line segmentation.

The EDPs of the LFrWF and FrWF architectures with and without multipliers are compared in Figures 8 and 9, respectively. The EDP, which characterizes both the consumed energy and the computational performance, is evaluated by multiplying the consumed energy with the corresponding clock cycle duration. We observe from Figures 8 and 9 that the EDPs of the proposed LFrWF architectures (with and without multipliers) are less than the EDPs of the corresponding FrWF architectures (with and without multipliers). The EDP of the proposed architecture with multipliers () is approximately 65% less than that for the FrWF architecture with multipliers, and the EDP of the proposed multiplier-less architecture () is approximately 43% less than that for the multiplier-less FrWF architecture. We observe from Figures 8 and 9 that the EDPs of the proposed LFrWF architectures increase with the number of segments. However, even with segments per image line, the EDPs of the proposed LFrWF architectures are less than those for the corresponding FrWF architectures.

5. Conclusion

This paper proposed and evaluated a lifting-based architecture to compute the DWT coefficients of an image based on the FrWF approach with a 5/3 filter-bank. The proposed architecture requires fewer adders and multipliers than state-of-the-art architectures. The proposed architecture with multipliers (LFr) and without multipliers (LFr) and the state-of-the-art FrWF architecture (with and without multipliers) [62] have been implemented on the same FPGA board and compared.

The experimental results show that the proposed LFr architecture requires less hardware components (and thus less area) and consumes 65% less energy than the Fr architecture. Moreover, the proposed LFr architecture consumes 43% less energy with only a slight increase in area compared to the Fr architecture. The lower energy consumption with minimal area overhead makes the proposed architectures promising candidates for computing the DWT of images on resource-constrained wearable sensors.

An important direction for future research is to integrate the LFrWF architecture with efficient architectures of state-of-the-art wavelet-based image coding algorithms to design FPGA-based image coders for real-time applications on wearable visual sensors and IoT platforms. Another interesting future research direction is the examination of the use of our proposed approach in the context of compressive sensing [15, 89].

Data Availability

The evaluation data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.