iVisual: An Intelligent Visual Sensor SoC with 2790fps CMOS Image Sensor and 205GOPS/W Vision Processor
Visual sensors combined with video analysis algorithms can enhance applications in surveillance, healthcare, intelligent vehicle control, human-machine interfaces, etc. Hardware solutions exist for video analysis. Analog on-sensor processing solutions  feature image sensor integration. However, the precision loss of analog signal processing prevents those solutions from realizing complex algorithms, and they lack flexibility. Vision processors [2, 3] realize high GOPS numbers by combining a
... ers by combining a processor array for parallel operations and a decision processor for other ones. Converting from parallel data in the processor array to scalar in the decision processor creates a throughput bottleneck. Parallel memory accesses also lead to high power consumption. Privacy is a critical issue in setting up visual sensors because of the danger of revealing video data from image sensors or processors. These issues exist with the above solutions because inputting or outputting video data is inevitable. iVisual is characterized as follows: 1) Privacy is protected by integrating 2790fps CMOS Image Sensor, 76.8GOPS vision processor and 1Mb storage. It is a light-in-answer-out SoC, and no video data need to be revealed outside the chip. 2) Feature processor eliminates the throughput bottleneck and increases throughput 36%. 3) The 205GOPS/W power efficiency is 5× better than previous works [2, 3] and is achieved by introducing a feature processor, a gatedclock scheme and by reducing memory accesses. Figure 16.1.1 shows the iVisual chip with four major parts: CMOS image sensor (CIS), global processor (GP), feature processor (FP) and decision processor (DP). GP is a parallel data in, parallel data out processor and controls the bitplane memory. FP is a parallel data-in, scalar-out processor and therefore eliminates the throughput bottleneck of data conversion. The DP processes scalar-in, scalar-out operations, that are usually decision results that further control the program execution of the GP and FP. The CIS is frame-pipelined with GP, FP and DP to increase hardware utilization. The port of bitplane memory is shared by CIS and GP; port collision is automatically handled. The port sharing of bitplane memory reduces SRAM area 64% and die area 16% with average collision probability below 0.1%. GP, FP and DP work concurrently. For each instruction, the availability of required resources is checked, including resources in other processors. An instruction will be executed only when all required resources are available. This simple scheme ensures minimum inter-processor communication to synchronize the three processors and increases throughput 23% compared with tightly-coupled processors  . The clocks of unused resources are turned off to reduce power. Figure 16 .1.2 shows the CIS read-out circuits. High-gain read-out circuits have been proven to have better SNR  . A gain stage before the ADC with four adjustable gains is provided. For the ADC, SAR-based  and ramp-based architectures  are combined to achieve a better area-speed trade-off. Compared with the conventional SAR architecture, the required cycle count is increased from 18 cycles to 20 cycles per sample, while ADC area is reduced 48.1%. The CIS has a peak frame-rate of 2790fps because of the parallel read-out architecture. Figure 16.1.3 shows the architecture and features of the vision processor. The GP execution unit is a SIMD processor array with 128 processing elements (PEs). The PE cache lies between the PE array and bitplane memory to reduce memory access 94%, saving 726mW of power. The PE cache itself consumes 134mW. Various bitplane memory access patterns and storage allocation schemes are provided to reduce the program size and increase storage den-sity. To enhance flexibility, each PE is indexed and has its own conditional control. PE operations, conditional control and bitwidth control can be executed in a cycle because of the high bandwidth provided by the PE cache. Multi-resolution processing is crucial in algorithms such as face detection and object tracking. Four modes of single-cycle upsample/downsample are provided. The FP eliminates the throughput bottleneck of data conversion from processor array to decision processor. FP is a parallel data-in, scalar-out processor that provides single-cycle feature extraction of data from the GP. The instruction set is designed from the analysis of algorithms and Intel OpenCV library. For example, the index of input data with minimum value can be extracted for calculating an object bounding box; the number of samples with value within a certain range can be extracted for color histograms. A tree-structured ALU architecture ensures a short timing path. Flexibility is increased by adding an enable signal in each input sample: calculations ignore disabled samples. The FP increases throughput 36% while occupying 4% area and consuming 5% of total power. Therefore, both area efficiency and power efficiency of iVisual are greatly increased. DP is a 32b processor with a MIPSlike instruction set and out-of-order control on parts of instructions. The DP register file is enlarged to access the data in GP. The DP can also control the program execution of GP and FP. Figure 16.1.4 compares the throughput of iVisual and the estimated throughput of the XETAL-II architecture  . Two execution units exist in XETAL-II: a processor array (LPA) and a decision processor (GCP). The effectiveness of the FP is illustrated in an example of calculating the minimum value of a frame. The processor array first calculates the minimum value of each column in parallel. In XETAL-II, the GCP then has to process the data from the LPA column-by-column. This is the throughput bottleneck. With iVisual, however, FP can extract the minimum value in a cycle. The table in Fig. 16.1.4 summarizes the comparison results. The table in Fig. 16 .1.5 shows the measured throughput of commonly used operations for video analysis with 128×128 resolution. For different video resolutions, the GP can reconfigure the storage allocation to process multiple or fractional rows per cycle. High throughput is achieved by FP eliminating the throughput bottleneck and the synchronization scheme maximizing the utilization. To show the capability of iVisual when handling complex algorithms, a posture analysis algorithm is also illustrated with a flowchart shown in Fig. 16 .1.5. Figure 16.1.6 and Fig. 16 .1.7 show measured chip features and the die photo, respectively. iVisual is implemented on a 7.5×9.4mm 2 die in a UMC 0.18μm 2P4M CIS process. Throughput is increased 36% with the introduction of FP and is further increased by 23% through use of the synchronization scheme. 205GOPS/W power efficiency is achieved thanks to FP, PE cache and gated-clock scheme. The comparisons of power efficiency and area efficiency are also illustrated. Acknowledgements: The authors thank Peter Chang, UMC University Program and ST team for the process support. This project is funded by Himax Technologies.