SIGDA, DATE 2006, Designers' Forum, Abstracts

DATE 2006 DESIGNERS' FORUM, ABSTRACTS

Sessions: [4D] [5D] [6D] [Interactive Presentations] [7D] [8D] [9D] [10D] [Interactive Presentations] [11D]

4D: Secure and Security Systems

Moderators: B. Kasser, STMicroelectronics, FR; G. Bertoni, STMicroelectronics, IT

Architectures for Efficient Face Authentication in Embedded Systems [p. 1]

N. Aaraj, S. Ravi, A. Raghunathan and N. K. Jha

Biometrics represent a promising approach for reliable and secure user authentication. However, they have not yet been widely adopted in embedded systems, particularly in resource-constrained devices such as cell phones and personal digital assistants (PDAs). In this paper, we investigate the challenges involved in using face-based biometrics for authenticating a user to an embedded system. To enable high authentication accuracy, we consider robust face verifiers based on principal component analysis/linear discriminant analysis (PCA-LDA) algorithms and Bayesian classifiers, and their combined use (multi-modal biometrics). Since embedded systems are severely constrained in their processing capabilities, algorithms that provide sufficient accuracy tend to be computationally expensive, leading to unacceptable authentication times. On the other hand, achieving acceptable performance often comes at the cost of degradation in the quality of results. Our work aims at developing embedded processing architectures that improve face verification speed with minimal hardware requirements, and without any compromise in verification accuracy. We analyze the computational characteristics of face verifiers when running on an embedded processor, and systematically identify opportunities for accelerating their execution. We then present a range of targeted hardware and software enhancements that include the use of fixed-point arithmetic, various code optimizations, application-specific custom instructions and co-processors, and parallel processing capabilities in multi-processor systems-on-chip (SoCs). We evaluated the proposed architectures in the context of open-source face verification algorithms running on a commercial embedded processor (Xtensa from Tensilica). Our work shows that fast, in-system verification is possible even in the context of many resource-constrained embedded systems. We also demonstrate that high authentication accuracy can be achieved with minimum hardware overheads, while requiring no modifications to the core face veriication algorithms.

Software Implementation of Tate Pairing over GF(2^m) [p. 7]

G. Bertoni, L. Breveglieri, P. Fragneto, G. Pelosi and L. Sportiello

Recently, the interest about the Tate pairing over binary fields has decreased due to the existence of efficient attacks to the discrete logarithm problem in the subgroups of such fields. We show that the choice of fields of large size to make these attacks infeasible does not lead to a degradation of the computation performance of the pairing. We describe and evaluate by simulation an implementation of the Tate pairing that allows to achieve good timing results, comparable with those reported in the literature but with a higher level of security.

Optimization of Regular Expression Pattern Matching Circuits on FPGA [p. 12]

C.-H. Lin, C.-T. Huang, C.-P. Jiang and S. C. Chang

Regular expressions are widely used in Network Intrusion Detection System (NIDS) to represent patterns of network attacks. Since traditional software-only NIDS cannot catch up to the speed advance of networks, many previous works propose hardware architectures on FPGA to accelerate attack detection. The challenge of hardware implementation is to accommodate the regular expressions to FPGAs of the large number of attacks. Although the minimization of logic equations has been studied intensively in the CAD area, the minimization of multiple regular expressions has been largely neglected. This paper presents a novel architecture allowing our algorithm to extract and share common sub-regular expressions. Experimental results show that our sharing scheme significantly reduces the area of regular expression circuits.

Satisfiability-Based Framework for Enabling Side-Channel Attacks on Cryptographic Software [p. 18]

N. R. Potlapally, A. Raghunathan, S. Ravi, N. K. Jha and R. B. Lee

Many electronic systems contain implementations of cryptographic algorithms in order to provide security. It is well known that cryptographic algorithms, irrespective of their theoretical strength, can be broken through weaknesses in their implementation. In particular, side-channel attacks, which exploit unintended information leakage from the implementation, have been established as a powerful way of attacking cryptographic systems. All side-channel attacks can be viewed as consisting of two phases - an observation phase, wherein information is gathered from the target system, and an analysis or deduction phase in which the collected information is used to infer the cryptographic key. Thus far, most side-channel attacks have focused on extracting information that directly reveals the key, or variables from which the key can be easily deduced. We propose a new framework for performing side-channel attacks by formulating the analysis phase as a search problem that can be solved using modern Boolean analysis techniques such as satisfiability solvers. This approach can substantially enhance the scope of side-channel attacks by allowing a potentially wide range of internal variables to be exploited (not just those that are "simply" related to the key). For example, software implementations take great care in protecting secret keys through the use of onchip key generation and storage. However, they may inadvertently expose the values of intermediate variables in their computations. We demonstrate how to perform side-channel attacks on software implementations of cryptographic algorithms based on the use of a satisfiability solver for reasoning about the secret keys from the values of the exposed variables. Our attack technique is automated, and does not require mathematical expertise on the part of the attacker. We demonstrate the merit of the proposed technique by successfully applying it to two popular cryptographic algorithms, DES and 3DES.

An 830mW, 586kbps 1024-Bit RSA Chip Design [p. 24]

C. Yeh, E.-F. Hsu, K.-W. Cheng, J.-S. Wang and N.-J. Chang

This paper presents an RSA hardware design that simultaneously achieves high-performance and low-power. A bit-oriented, split modular multiplication algorithm and architecture are proposed to fully exert the radix-4 computational capability. Further, we identify the switching profile of RSA data and accordingly propose power-optimized designs for the storage elements and key computational components. The complete RSA modular exponentiation hardware has been implemented using cell-based 0.18μm CMOS technology. Post-layout simulation shows that the design delivers an average performance of 586kbps at 460MHz, 1.8V while consuming only 830mW.

Platform Independent Debug Port Controller Architecture with Security Protection for Multi-Processor System-on-Chip ICs [p. 30]

D. Akselrod, A. Ashkenazi and Y. Amon

A Debug Port Controller (DPC) architecture, designed for re-use in multiple System-on-Chip (SoC) Integrated Circuits (ICs) is presented. The DPC incorporates security protection against unauthorized access along with advanced debugging features such as long chain debugging, universal BIST engines control, and generic serial interfaces. An implemented security architecture of DPC is presented together with an overall IC security scheme. DPC is the most important part of this IC security scheme. The suggested architecture demonstrates extensive use of the debug process, and re-use of the DPC in multiple SoC ICs without the need of adopting its design for a specific SoC. The implementation of the DPC for IEEE1149.1 standard is presented and the hardware realization of the proposed architecture is described in detail. The DPC that incorporates the proposed architecture has been designed in a 90 nm CMOS process as an integral part of several SoC ICs.

5D: Reconfigurable Computing

Moderators: C. Heer, Infineon Technologies, DE; H. Blume, TU Aachen, DE

Automated Conversion from LUT-Based FPGA to a LUT-Based MPGA with Fast Turnaround Time [p. 36]

F.-J. Veredas, M. Scheppler and H.-J. Pfleiderer

Mask Programmable Gate Arrays (MPGAs) see a growing importance because of the increase of design cost and turnaround times in ultra-deep submicron technologies which mostly impact ASICs. Several design methodologies have been proposed in recent years for converting an evaluated Field-Programmable Gate-Array (FPGA) prototypedesign into an MPGA. An automatic conversion flow is essential to success. In this paper, we present a conversion flow for a Look-up Table-based (LUT-based) MPGA without applying re-synthesis but preserving the gate-level netlist and reusing the placement. The resulting flow has a special routing tool and buffer insertion algorithm for timing integrity. The experimental investigations use a commercial FPGA and industrial benchmarks.

Energy-Efficient FPGA Interconnect Design [p. 42]

M. Meijer, R. Krishnan and M. Bennebroek

Despite recent advances in FPGA devices and embedded cores, their deployment in commercial products remains rather limited due to practical constraints on, for example, cost, size, performance, and/or energy consumption. In this paper, we address the latter bottleneck and propose a novel FPGA interconnect architecture that reduces energy consumption without sacrificing performance and size. It is demonstrated that the delay of a fullswing, fully-buffered interconnect architecture can be matched by a low-swing solution that dissipates significantly less power and contains a mix of buffer and passgate switches. The actual energy savings depend on the specifics of the interconnect design and applications involved. For the considered fine-grain FPGA example, energy savings are observed to range from a factor 4.7 for low-load critical nets to a factor 2.8 for high-load critical nets. The results are obtained from circuit simulations in a 0.13 μm CMOS technology for various benchmarks.

A New Approach to Compress the Configuration Information of Programmable Devices [p. 48]

M. Martina, G. Masera, A. Molino, F. Vacca, L. Sterpone and M. Violante

During the last decade programmable devices have gained an impressive diffusion, tackling some traditional ASIC marked domains. In particular, multi-million gate FPGAs have become a very appealing low-cost solution even for consumer applications. However, one of the big issues that can arise with modern FPGA devices is the need for large and expensive external non-volatile memory to keep the configuration data. In this work we developed an alternative technique to compress FPGA bitstreams based on the knowledge of the device internal structure. The proposed method performs a two-step coder: in the first step the bitstream is adaptively "fitered" to remove data redundancy, while in the second step an arithmetic coder is used to actually compress the information. The effectiveness of the proposed technique has been demonstrated on a set of case studies. As a result conventional approaches are outperformed reaching a compression ratio of 4.26 against 3.3 times.

Design and Implementation of a Rendering Algorithm in a SIMD Reconfigurable Architecture (MorphoSys) [p. 52]

J. Davila, A. de Torres, J. M. Sanchez, M. Sanchez-Elez, N. Bagherzadeh and F. Rivera

In this paper we analyze a 3D image rendering algorithm and the different mapping schemes to implement it in a SIMD reconfigurable architecture. 3D image render is highly computational and has an important restriction in execution time due to the requirement to get interactive results. We demonstrate that the execution of this algorithm in MorphoSys can take advantage of the available parallel resources, as well as of the possibility of one cycle configuration change. In this paper we show that it is possible to implement the rendering algorithm in our coarse grain reconfigurable architecture, obtaining values over 100 fps.

Application Specific Instruction Processor Based Implementation of a GNSS Receiver on an FPGA [p. 58]

G. Kappen and T. G. Noll

In this paper the concept of a reconfigurable hardware macro to be used as a generic building block in lowpower, low-cost SoC for multioperable GNSS positioning is described, featuring sufficient computational power and flexibility. The central processing unit of the reconfigurable hardware macro is an ASIP accelerated by additional eFPGA and weakly configurable ASIC based coprocessors. The different hardware building blocks (i.e. ASIP, eFPGA, ASIC) of the target architecture are motivated with state of the art GNSS receiver algorithms. To explore the design space of the target architecture and to develop appropriate partitioning cost functions a GNSS receiver testbed was realised on an FPGA board. The testbed utilises a programmable ASIP, designed and generated with the processor description language LISA, as a central processing unit. As a first accelerating coprocessor the correlator was realised. Exemplary optimisations of the ASIP / co-processor architecture as well as the achieved improvements are described.

A Methodology for FPGA to Structured-ASIC Synthesis and Verification [p. 64]

M. Hutton, R. Yuan, J. Schleicher, G. Baeckler, S. Cheung, K. K. Chua and H. K. Phoon

Structured-ASIC design provides a mid-way point between FPGA and cell-based ASIC design for performance, area and power, but suffers from the same increasing verification burden associated with cell-based design. In this paper we address the verification issue with a methodology and fabric to directly tie FPGA prototype and functional in-system verification with a clean migration path to structured ASIC. The most important aspects of this methodology are the use of physically identical blocks for difficult-to-verify PLLs, I/O and RAM and a structured re-synthesis of FPGA logic blocks to target cells that guarantees anchor points for easy formal verification.

6D: Specification and Verification

Moderators: M. de Marinis, SensorDynamics, DE; D. Strle, Ljubljana U, SL

Synthesis of System Verilog Assertions [p. 70]

S. Das, R. Mohanty, P. Dasgupta and P. P. Chakrabarti

In recent years, Assertion-Based Verification is being widely accepted as a key technology in the pre-silicon validation of system-on-chip(SOC) designs. The System Verilog language integrates the specification of assertions with the hardware description. In this paper we show that there are several compelling reasons for synthesizing assertions in hardware, and present an approach for synthesizing System Verilog Assertions (SVA) in hardware. Our method investigates the structure of SVA properties and decomposes them into simple communicating parallel hardware units that together act as a monitor for the property. We present a tool that performs this synthesis, and also show that the chip area required by the monitors for a industry standard ABV IP for the ARM AMBA AHB protocol is quite modest.

Generating Finite State Machines from SystemC [p. 76]

A. Habibi, H. Moinudeen and S. Tahar

SystemC is a system level language proposed to raise the abstraction level for embedded systems design and verification. In this paper, we propose to generate Finite State Machines (FSM) from SystemC designs using two algorithms originally proposed for the generation of FSM from Abstract State Machines (ASM). This proposal enables the integration of SystemC with existing tools for test case generation from FSM. Hence, enabling two important applications: (1) using the FSM graph structure to produce test suites allowing functional testing of SystemC designs; and (2) performing conformance testing, where the FSM serves as a precise model of the observable behavior of the system used to validate lower abstraction levels of the design (e.g., Register Transfer Level (RTL)).

Flexible Specification and Application of Rule-Based Transformations in an Automotive Design Flow [p. 82]

J. H. Oetjens, J. Gerlach and W. Rosenstiel

This paper addresses an XML-based design environment, which provides a powerful basis for the manipulation of hardware design descriptions. The contribution of the paper is a flexible specification entry for the definition of transformation rules, which allows a designer to specify transformations by his/her own without having XML expertise. The specification entry provides a guided and graphically supported mechanism to define transformation rules. This opens up a new approach, in which the specification and verification of a transformation rule is carried out by using simple design examples, to be applied to arbitrary complex designs subsequently. A new key characteristic of our approach is that both transformation environment and transformation entry tool are based on a very compact definition of the hardware description language grammar in use, and both of them are fully automatically generated from that basic grammar definition. This makes our approach highly open for other hardware and system specification languages. The paper describes the transformation environment and transformation entry tool, and demonstrates its application in terms of two automotive-typical transformations, addressing power aspects on the one hand, and safety aspects on the other.

A Mixed-Signal Verification Kit for Verification of Analogue-Digital Circuits [p. 88]

G. Bonfini, M. Chiavacci, R. Mariani and E. Pescari

This paper presents an innovative approach for analogue and mixed-signal verification. It consists in a "verification kit" that makes use of concepts used in state-of-art digital verification, such as automatic results collection, coverage elaboration, data checking capability, pseudo-random and constrained stimuli generation. Using a Bandgap cell as case study, the paper shows as the presented approach allows a precise definition of the verification space and a saving of more than 50% of the total verification effort respect traditional verification methodologies. The paper shows also how the approach can be extended to more complex mixed-signal systems.

A Complete and Fully Qualified Design Flow for Verification of Mixed-Signal SoC with Embedded Flash Memories [p. 94]

P. Daglio

Today almost all the people in the industry are talking widely about full chip mixed-signal simulation, both in pre-layout and post-layout conditions, basically for two main reasons: a large range of applications is moving from fully digital to mixed-signal and full chip simulation with parasitic components, together with IR drop analysis, is becoming strictly mandatory before going to silicon. In fact, the cost of a mask set for a 90nm or a 65nm technology is growing in an exponential way, passing the million dollar for any single mask set. For these reasons, it is strategic to set up a very complete mixed-signal design flow allowing designers to go to the silicon in a safe way with the minimum risk of failure. Nowadays, various approaches to the same problem are pursued by different organizations, sometimes privileging the fully digital modeling of the mixed-signal system and some other times setting the digital part in VHDL and keeping the analog part at transistor level, simulating the whole chip with a mixed-signal simulator. Which is the right approach ? Which are the status and the reliability of the tools on the market ? Which is the acceptable trade-off among simulation speed, code coverage and precision of simulation results ? This paper tries to answer to these questions proposing a fully qualified and complete mixed-signal flow for SoC verification, implemented to design applications also containing embedded flash memories.

Software-Friendly HW/SW Co-Simulation: An Industrial Case Study [p. 100]

J. Noguera, L. Baldez, N. Simon and L. Abello

This paper proposes a novel HW/SW co-simulation approach that minimizes the impact on software designers. We propose a SystemC-based system that enables the software team to test their software with their own tools and environment using an accurate simulated ASIC (Application Specific Integrated Circuit) model. The solution presented here enables a smooth and early ASIC and SW integration, which reduces the project development time and improves the ASIC design quality (i.e., SW engineers can help in the ASIC verification and ASIC engineers can help in the SW development). In this solution, the real and full software (i.e., multi-threaded application) runs in its native environment with minimal changes and interfaces with a simulated ASIC model using sockets. We have tested this approach on a pilotproject, which has demonstrated the feasibility of this co-development methodology.

7D: Wireless Communication and Networking

Moderators: C. Grassmann, Infineon Technologies, DE; W. Mueller, C-LAB/Paderborn U, DE

Modeling and Simulation of Mobile Gateways Interacting with Wireless Sensor Networks [p. 106]

F. Fummi, D. Quaglia, F. Ricciato and M. Turolla

Sensor networks are emerging wireless technologies; their integration with the existing 2.5G, 3G mobile networks is a key issue to provide advanced services, e.g., health control. However this integration poses new challenges in the design and simulation of the involved embedded systems since it requires the cooperation of simulation tools that model hardware, software, and network aspects and their interactions. We present the modeling and simulation of a network scenario, core of a telecom provider's future portfolio, in which an ARM-based mobile handset is used as the gateway between a wireless sensor network (WSN) and remote users through a wide area network (WAN). Initially, the gateway and the WSN are modeled at system level with SystemC while the wide area network is modeled with NS-2. Then, HW/SW partitioning is performed on the gateway and an instruction set simulator of the ARM processor is used for the cycle-accurate execution of the RTOS and the application software.

A Hardware-Engine for Layer-2 Classification in Low-Storage, Ultra-High Bandwidth Environments [p. 112]

V. Papaefstathiou and I. Papaefstathiou

Ethernet is the most common Layer-2 network protocol, and it is currently being deployed beyond the tight borders of LANs. In order to accommodate the needs of MANs and WANs, several QoS mechanisms employed at the MAC sublayer of Ethernet have been proposed. These QoS mechanisms require identification of network flows and the classification of Ethernet packets according to certain Ethernet header fields. In this paper, we propose a classification engine employed at the MAC sublayer which uses an innovative hashing scheme and internal replacement of MAC Vendor IDs; the Hash Based Classification Engine (HBCE) compacts the tables containing the rules associated with certain MAC addresses and supports extremely high speed decisions - at a rate of more than 100Gb/sec-, while its memory needs are significantly lower compared to those of the similar schemes currently used. This engine has been implemented in hardware utilizing less than 0.1mm² in a state of the art CMOS technology. As a result HBCE is a very promising candidate for the next-generation Ethernet equipments that need to support classification at Data Link Layer at multi-Gigabit per second network speeds, whereas due to its very low memory requirements and low implementation complexity, it can also be employed very efficiently in lower-bandwidth wireless environments that utilize MAC mechanisms.

ASIP Architecture for Multi-Standard Wireless Terminals [p. 118]

D. Lo Iacono, J. Zory, E. Messina, N. Piazzese, G. Saia and A. Bettinelli

This paper presents the Block Processing Engine (BPE), an Application Specific Instruction-Set Processor (ASIP) explicitly designed for the implementation of multistandard wireless terminals. Thanks to a high level of parallelism and a consistent use of pipeline, the BPE architecture fully satisfies stringent real-time constraints imposed by emerging technologies. Its efficiency has been proven through the implementation, the physical synthesis for the CMOS 90nm STM technology and the FPGA prototyping on the ARM Versatile platform of a dualstandard Frequency Domain Equalizer (FDE) supporting the 3GPP HSDPA and the IEEE 802.11a standards.

Interconnection Framework for High-Throughput, Flexible LDPC Decoders [p. 124]

F. Quaglio, F. Vacca, C. Castellano, A. Tarable and G. Masera

This paper presents a possible interconnection structure suitable for being used in a flexible LDPC decoder. The main feature of the proposed approach is the possibility of implementing parallel or semi-parallel decoders with a reduced communication complexity. To the best of our knowledge this is the first work detailing the implementation of a fully flexible LDPC decoder, able to support any type of code. To prove the effectiveness of this approach, a complete decoder has been implemented on a XC2V8000, achieving a decoding throughput of 529 Mbps on a (1920,640) code.

Low Cost LDPC Decoder for DVB-S2 [p. 130]

J. Dielissen, A. Hekstra and V. Berg

Because of its excellent bit-error-rate performance, the Low-Density Parity-Check (LDPC) algorithm is gaining increased attention in communication standards and literature. The new Digital Video Broadcast via Satellite standard (DVB-S2) is the first broadcast standard to include a LDPC-code, and the first implementations are available. In our investigation of generic LDPC-implementations we found that scalable sub-block parallelism enables efficient implementations for a wide range of applications. For the DVB-S2 case, using sub-block parallelism we obtain half the chip-size of known solutions. For the required performance in the normative configurations for the broadcast service (90Mbps), the area is even

3dID: A Low-Power, Low-Cost Hand Motion Capture Device [p. 136]

M. Sama, V. Pacella, E. Farella, L. Benini and B. Riccó

This paper presents a novel input device design for capturing gestures. The system is based on commodity components and combines accelerometers, gyroscopes and bend sensors. It is a low-power, low-cost hand device, characterized by extreme wearability thanks to wireless communication support and small form-factor. It can be used as a stand-alone platform or combined with other wireless sensor nodes in a body area network. The system has been tested as input interface for moving a virtual three-dimensional hand in real-time.

8D: HOT TOPIC - Industrially Proving SPIRIT Consortium Standards for Design Chain Integration

Organiser/Moderator: C. K. Lennard, ARM Ltd, UK

Industrially Proving SPIRIT Consortium Standards for Design Chain Integration [p. 142]

V. Berman, S. Fazzari, M. Indovina, C. Ussery, M. Strik, J. Wilson, O. Florent, F. Rémond, P. Bricaud

There has traditionally been significant engineering overhead required for the integration of multi-vendor tool and IP design methodologies. Making design-chain integration efficient is the key objective of the SPIRIT Consortium. This Special Session paper provides an insight into how the specifications of the SPIRIT Consortium are being adopted in the industry today. We present 3 production design-flow stories which show improved efficiency gained through use of the SPIRIT Consortium specifications. These include an IP generator for hierarchical VLIW processor design, a full hardware / software SoC integration design flow managed through generators, and methodology support for a flow from electronic system level (ESL) design through to the 65 nm CMOS process.

9D: On Chip Communication Networks

Moderators: K. Goossens, Philips Research, NL; M. Coppola, STMicroelectronics, FR

Networks on Chips for High-End Consumer-Electronics TV System Architectures [p. 148]

F. Steenhof, H. Duque, B. Nilsson, K. Goossens and R. Peset Llopis

Consumer electronics products, such as high-end (digital) TVs, contain complex systems on chip (SOC) that offer high computational performance at low cost. Traditionally, these SOCs are application-specific standard products (ASSPs) with limited programmability. We describe why TV SOCs must become more flexible, and why companion chips together with networks on chips (NOC) are a crucial enabling technology. In particular, networks that span multiple chips will become important in the near future. We demonstrate our ideas by extending a commerciallyavailable SOC for picture improvement in high-end TVs with the ethereal NOC. Our first unoptimised results indicate that replacing the original interconnect (consisting of dedicated links and multiplexers for bypasses) by programmable NOC increases the SOC area by 4% and its power dissipation by 12%. The new, flexible SOC allows new tasks to be spliced in at any point in the task graph. Both analytical performance verification and system simulations at RTL VHDL show that the extended SOC meets its functional requirements. Using the &AELig;thereal design flow the extended architecture was designed, implemented, and verified in 12 person months. To the best of our knowledge, this is the first application of a NOC to a commercial SOC. The quantitive results indicate that even retrofitting a NOC to an existing architecture is beneficial at acceptable cost.

Simulation and Analysis of Network on Chip Architectures: Ring, Spidergon and 2D Mesh [p. 154]

L. Bononi and N. Concer

NoC architectures can be adopted to support general communications among multiple IPs over multi-processor Systems on Chip (SoCs). In this work we illustrate the modeling and simulation-based analysis of some recent architectures for Network on Chip (NoC). Specifically, the Ring, Spidergon and 2D Mesh NoC topologies have been compared, both under uniform load and under more realistic load assumptions in the SoC domain. The main performance indexes considered are NoC throughput and latency, as a function of variable data-injection rates, source and destination distributions, variable number of nodes. Results show that the Spidergon topology is a good trade-off between performance, scalability of the most efficient architectures inherited from the parallel computing systems design, constraints about simple management, small energy and area requirements for SoCs.

GALS Networks on Chip: A New Solution for Asynchronous Delay-Insensitive Links [p. 160]

G. Campobello, M. Castano, C. Ciofi and D. Mangano

In this paper a cost effective solution for asynchronous delay-insensitive on-chip communication is proposed. Our solution is based on the Berger coding scheme and allows to obtain a very low wire overhead. For instance, the results of our evaluation show that a 64-bit link can be built paying a wire overhead of 10% and 30 equivalent two-input gates per wire. As a general rule, when the number of bits to be transmitted increases, the wire overhead decreases and the gate overhead remains almost the same.

Flexible MPSoC Platform with Fast Interconnect Exploration for Optimal System Performance for a Specific Application [p. 166]

F. Dumitrascu, I. Bacivarov, L. Pieralisi, M. Bonaciu and A.A. Jerraya

One of the key elements in Multi-Processor Systems-on-Chip (MPSoC) design is to select the optimal on-chip interconnect architecture, in order to maximize the overall system performance. This paper proposes a flexible MPSoC platform, designed for a target application, which allows customizing the interconnect by selecting various architectures. It allows fast building of executable models from architecture specifications and performance evaluation using the cycle-accurate cosimulation. We experimented a DivX encoder application with three different interconnects: DMS (Distributed Memory Server), AMBA bus and Octagon Network-on-Chip (NoC). The simulation results relative to performance metrics such as, average latency, throughput and execution time allowed to compare these different interconnect architectures, to verify the application real-time constraints and to propose further optimizations.

STAX: Statistical Crosstalk Target Set Compaction [p. 172]

S. Nazarian, M. Pedram, S.K. Gupta and M.A. Breuer

This paper presents STAX, a crosstalk target set compaction framework to reduce the complexity of the crosstalk ATPG process by pruning non-fault-producing targets. In general, existing pruning techniques do not employ their processes in a cost-effective manner. Neither do they handle process variations properly. To address the first weakness, this paper presents a framework to determine a sequence of available analysis and pruning tool invocations to prune as many of the crosstalk targets as fast as possible. As a result, an initially enormous collection of crosstalk targets is usually reduced to a very small set of targets via a vectorless process. A statistical static timing analyzer is developed and embedded to address the second shortcoming of existing approaches. Experimental results on ISCAS'85 benchmark demonstrate that STAX greatly improves the runtime compared to other crosstalk target pruning methodologies, including ATPG, with no prior target set compaction.
KEYWORDS ATPG, fault-producing target, compaction degree, pruning power, safe target, statistical static timing analyzer

A Fast-Lock Mixed-Mode DLL with Wide-Range Operation and Multiphase Outputs [p. 178]

K.-H. Cheung and Y.-L. Lo

10D: Automotive

Moderators: L. Fanucci, Pisa U, IT; J. Gerlach, Robert Bosch GmbH, DE

How OEMs and Suppliers Can Face the Network Integration Challenges [p. 183]

K. Richter and R. Ernst

Systems integration is a major challenge in many industries. Systematic analysis of the complex integration effects, especially with respect to timing and performance, significantly improves the design process, enables optimizations, and increases the quality and profit of a product. And it helps to improve supply-chain communications. This paper surveys a set of interesting experiments we have conducted on a real-world automotive communication network using our new SymTA/S system-level schedulability analysis technology. We demonstrate that, and how, analysis technology helps answering key integration questions, thereby carefully respecting the established business models.

A Practical Implementation of the Fault-Tolerant Daisy-Chain Clock Synchronization Algorithm on CAN [p. 189]

F. C. Carvalho, C. E. Pereira, E. T. Silva, Jr. and E. P. Freitas

Networked processing units are becoming widely used in the automotive embedded system domain aiming not only to reduce vehicle weight and cost but also to assist the driver to cope with critical situations. Because the fact that these embedded networked systems are strictly involved with human safety, there is a high demand on dependability requirements which can only be guaranteed if active redundancy is employed. Considering that the processing units are usually connected by a shared serial media, the underlying communication platform is the most important building block. It must provide low-level support for deterministic data transmission as well as a global time base to coordinate the actions of replicated units. Within this context, this paper presents the development of the fault-tolerant Daisy-Chain clock synchronization algorithm over the CAN protocol, resulting in an highly optimized communication architecture for safety-critical applications. Implementation issues and some obtained practical results are also discussed in the paper.

On the Verification of Automotive Protocols [p. 195]

G. Zarri, F. Colucci, F. Dupuis, R. Mariani, M. Pasquariello, G. Risaliti and C. Tibaldi

Verification quality is a must for functional safety in electronic systems. In automotive, the verification flow is historically based on a layered approach, where each level (model, design and system) has its proper verification and validation methodology. Very often, these methodologies are badly or not interconnected at all one to another, and it's still common to see some of the most critical verification tasks confined to postsilicon validation, where costs to solve issues could be a killing factor for deeply integrated electronic systems. This paper presents the architecture of verification components that can be applied in all the different levels and shows how they have been successfully applied to the verification of systems integrating LIN, CAN and FlexRay protocols.

FlexRay Transceiver in a 0.35 μm CMOS High-Voltage Technology [p. 201]

F. Baronti, P. D'Abramo, M. Knaipp, R. Minixhofer, R. Roncella, R. Saletti, , M. Schrems, R. Serventi and V. Vescoli

This paper presents one of the first fully functional FlexRay transceivers manufactured in a 0.35 μm CMOS High-Voltage technology, which provides high voltage MOS devices together with standard 3.3 V gates. The circuit operates as interface between a generic controller and the copper wire FlexRay physical bus, to be used in fault tolerant and fail safe applications. In particular, the transceiver meets the operating requirements of the automotive environment. The design was validated by means of simulations and experimental measurements on fabricated prototypes.

Space-Efficient FPGA-Accelerated Collision Detection for Virtual Reality [p. 206]

A. Raabe, S. Hochguertel, J. Anlauf and G. Zachmann

We present a space-efficient, FPGA-optimized architecture to detect collisions among virtual objects. The design consists of two main modules, one for traversing a hierarchical acceleration data structure, and one for intersecting triangles. This paper focuses on the former. The design is based on a novel algorithm for testing discretely oriented polytopes for overlap in 3D space. In addition, we derive a new overlap test algorithm that can be implemented using fixed-point arithmetic without producing false negatives and with bounded error. SystemC simulation results on different levels of abstraction show that real-time collision detection of complex objects at rates required by force-feedback and physicallybased simulations can be obtained. In addition, synthesis results show that the design can still be fitted into a sixmillion gates FPGA. Furthermore, we compare our FPGAbased design with a fully parallelized ASIC-targeted architecture and a software implementation.

Mixed-Signal Design of a Digital Input Power Amplifier for Automotive Audio Applications [p. 212]

S. Saponara and P. Terreni

With reference to digital input power amplifier for automotive audio applications, the paper presents an exhaustive exploration of the huge mixed-signal space to find optimal trade-offs among different cost-functions: distortion, efficiency, circuit complexity and sensitivity. Different architectural solutions are modelled and compared in a Simulink/Spice framework. All building blocks (i.e. oversampling filter, noise shaping, type of PWM modulation, type of feedback, power stage, LC filter) are optimized considering the whole system performance. A novel mixed-signal scheme is finally derived and prototyped.

Interactive Presentations

Automatic SystemC Design Configuration for a Faster Evaluation of Different Partitioning Alternatives [p. 217]

N. Bannow, K. Haug and W. Rosenstiel

In this paper we present a methodology that is based on SystemC [1] for rapid prototyping to greatly enhance and accelerate the exploration of complex systems to optimize the system architecture. The approach introduces a methodology to automatically configure system components with regards to the memory mapping of modules. The approach reduces the implementation effort that in conventional approaches has to be done by hand to re-assign and re-configure modules in a system. This does not only save time for manual adaptation but also reduces the chance to introduce errors like known from complex manual modifications. The new approach for automatic system configuration is derived as one of the results and features that come along with the Module-Adapter (MA) based approach that we have proposed in different presentations [2], [3], [4]. One of the main goals, our proposed methodology has to fulfill, are industrial requirements such as applicability for complex system development, integration of existing IP, improved code quality and decreased development effort. The automated system configuration as well as the whole MA based approach greatly support the designers in the concept phase to simulate a design before the implementation starts.

Multi-Sensor Configurable Platform for Automotive Applications [p. 219]

L. Serafini, F. Carrai, T. Ramacciotti and V. Zolesi

This paper presents a configurable and generic platform architecture suitable to interface several kinds of sensors for automotive applications. A platform-based design approach is pursued to reduce time-to-market. The platform is essentially a library of hardware and software reconfigurable resources. It is based on a microprocessor core plus a set of analog and digital peripherals dedicated to signal acquisition, data processing, storage and transmission. A particular instance of this platform has been developed. The prototype electronic board produced is able to acquire temperature, humidity, pressure and perform voltage/current measurements and settings. The results achieved prove the validity of the proposed approach in terms of system performance and high reconfigurability of the generic platform.

11D: Media and Signal Processing

Moderators: M. Heijligers, Philips Research, NL; L. Benini, DEIS - Bologna U, IT

Design and Implementation of a Modular and Portable IEEE 754 Compliant Floating-Point Unit [p. 221]

K. Karuri, R. Leupers, G. Ascheid, H. Meyr and M. Kedia

Multimedia and communication algorithms from embedded system domain often make extensive use of floating-point arithmetic. Due to the complexity and expense of the floating-point hardware, final implementations of these algorithms are usually carried out using floating-point emulation in software, or conversion (manually or automatically) of the floating-point operations to fixed point operations. Such strategies often lead to semioptimal and imprecise software implementation. This paper presents the design and implementation of a Floating-Point Unit (FPU) for anApplication Specific Instruction set Processor (ASIP) suitable for embedded systems domain. Using a state-of-the-art Architecture Description Language (ADL) based ASIP design framework, the FPU is implemented in such a modular way that it can be easily adapted to any otherRISClike processor. The implemented operations are fully compliant to the IEEE 754 standard which facilitates portable software development. The benchmarking, in terms of energy, area and speed, of the designed FPU highlights the trade-offs of having a hardware FPU w.r.t. software emulation of floating-point operations.

A Novel FPGA-Based Implementation of Time Adaptive Clustering for Logical Story Unit Segmentation [p. 227]

S. Arifin and P. Y. K. Cheung

Time Adaptive Clustering (TAC) is a cognitive Logical Story Unit (LSU) segmentation algorithm that is found to show good and consistent results. This paper presents an efficient hardware implementation for approximating the TAC algorithm. The design consists of three main blocks. The first block generates similarity values needed in the clustering process. To take full advantage of the parallelism of Field Programmable Gate Arrays (FPGA) devices, a video shot sequence is divided into subsets and processed in parallel by the second block. The third block combines all the output results of each subset. The design is implemented on a Xilinx Virtex-II xc2v3000 on board a RC203E board and it runs 27 times faster than a Pentium 4-based PC at 3.4 Ghz.

ASIP Design and Synthesis for Non Linear Filtering in Image Processing [p. 233]

L. Fanucci, M. Cassiano, S. Saponara, D. Kammler, E. M. Witte, O. Schleibusch, G. Ascheid, R. Leupers and H. Meyr

This paper presents an Application Specific Instruction Set Processor (ASIP) design for the implementation of a class of nonlinear image processing algorithms, the Retinex-like filters. Starting from high level descriptions, first algorithmic optimization is accomplished. Then a processor architecture and an instruction set are customized with special respect to the algorithmic computations in order to achieve the specified timing at reasonable complexity. Taking advantage of the programmability of processor architectures, the flexibility of the system is increased, involving e.g. dynamic parameter adjustment and color treatment. ASIP implementation results in 0.13 μm CMOS technology are presented.

A 124.8Msps, 15.6mW Field-Programmable Variable-Length Codec for Multimedia Applications [p. 239]

C. Yeh, C.-C. Wang, L.-C. Lee and J.-S. Wang

Variable-length coding is one of the key compression methods for multimedia bitstreams. To accommodate new or user-defined variable-length codes (VLC) for maximal compressions in various applications, we propose a variable-length codec that supports field programmability along with very competitive performance indices. The design has 33% less transistors than its field-programmable predecessor. Moreover, measurement on the real chip demonstrates that the design is capable of processing 124.8 mega-symbols (Msym) per second for MPEG4, while consuming only 15.6mW at 1.4V. When measured by μW/Msym, the realized variable-length codec is even 5% better than the state-of-the-art non-programmable MPEG2 variable-length decoder that hardwires the entire design into random logic.

The Vector Fixed Point Unit of the Synergistic Processor Element of the Cell Architecture Processor [p. 244]

N. Maeding, J. Leenstra, J. Pille, R. Sautter, S. Buettner, S. Ehrenreich and W. Haller

A Vector Fixed Point Unit (FXU) is designed to speed up multi-media processing. The FXU implements SIMD style integer arithmetic and permute operations. The adder, rotator and permute structure enables the use of static circuits only. The FXU was fabricated using IBM 90nm CMOS SOI technology.

Design and Test of Fixed-Point Multimedia Co-Processor for Mobile Applications [p. 249]

J.-H. Sohn, J. H.-Woo, J. Yoo and H.-J. Yoo

In this research, a fixed-point multimedia co-processor is designed and tested into an ARM-10 based mobile graphics processor for portable 2-D and 3-D multimedia applications. The fixed-point co-processor architecture with dual operations realizes advanced 3-D graphics algorithms and various streaming multimedia functions in a single hardware while consuming low power. The instruction-wise clock gating on fixed-point SIMD datapath allows fine-grained power control in application-specific manner. The co-processor takes 10.2mm² in 0.18μm 6-metal standard CMOS logic process and achieves 50Mvertices/s graphics performance with 75.4mW power consumption. The implemented chip is successfully demonstrated on the development board equipped with software graphics library and evaluation environment.