#### **Advanced DSP Processors and** Applications Subra Ganesan, Oakland University

**Instructor(s)** Name



Professional workshops in a learning environment

Venue: Embedded System Workshop **October 13, 2012** 



#### **Oakland** UNIVERSITY Real Time DSP Systems and Applications

By: Dr. Subra Ganesan Professor, ECE Department, Oakland University Rochester, MI 48309. USA.

ganesan@oakland.edu

## ABSTRACT

# Real Time Digital Signal Processing Systems and Applications

- This presentation covers advances in hard and soft real time computer system design for uni-processor embedded DSP system applications and distributed real time DSP systems.
- Topics covered include characterizing real-time systems, performance measure, task assigning, scheduling, Advances in DSP, VLIW-DSP, Davinci DSP, FPGA, SOC, Parallel DSP systems, software development techniques; Research Issues, Practical applications in military and consumer products.

## Introduction

Advances in circuit technology, architecture, algorithms and VLSI design techniques have contributed to high performance Digital Signal Processing (DSP) microprocessors and to multitude of novel applications of DSP chips.

DSP processors are RISC based which have fast arithmetic units, on chip memory, analog interface, serial ports, timers, counters, facilities for inter processor communications and other special features.

Current high performance DSP processors have VLIW architecture, **multiple heterogeneous/ homogeneous cores**, high speed interface to co-processors etc.

## **Real Time Systems**

**Real time embedded systems are used in many applications. For the Embedded systems we need the following:** 

- Theoretical analysis,
- Design with Use of advanced design tools,
- Use of the latest real time software,
- Performance monitoring,
- Simulation and testing in a systematic way

## RT DSP System

- Any system where a timely response by the computer to external input/condition is vital is a real-time system.
- Timely: Meet the deadline.
- **Deadline:** Hard and soft
- **Complete task:** Accurately or estimate within deadline.
- Hard Real-Time Systems examples:

Aircraft, Nuclear Reactor control.

**Soft Real-time Systems examples:** 

Multimedia, Internet access.

#### **Automotive Embedded Applications**



#### A Car and Driver Example

Driver is the real time controller. Car is the controlled process. Road is the operating environment Mission is to reach the destination **Performance Measure:** 

1)Time to reach the destination under various road conditions

2)Safety of the driver even if the destination is not reached

Task deadlines are not constants ,varies with the operating environment. Writing formal specification and relating them is difficult.

Subra Ganesan

# Issues in Real-time

## **Computing**

Research areas cover: Computer architecture, fault tolerant computers, networks, embedded systems, standards, digital communication, operating system clock synchronization,

Example:

**Task-Scheduling:**For a normal system the goal is fairness to all tasks like Round-robin scheduling.

For a RT system: Meet the deadlines for critical and high priority tasks is the goal.

<u>Task-Execution time</u> should be predictable in RT system. For a cache based RT system memory access time varies.

#### Characteristics of Real-time Operating Systems

Real-time operating systems can be characterized as having unique requirements in five general areas:

- Determinism
- Responsiveness
- User control
- Reliability
- Fail-soft operation

#### DSP MICROPROCESSORS

DSP micros **are reduced-instruction-set computers** optimized for the fastest possible execution of the following instructions

- Addition
- •Subtraction
- Multiplication
- Shifting

Single cycle multiplication and shifting using ARRAY multiplier and barrel (or combination) shifter.

In contrast, general purpose micros effect such as operations via multiple cycle, micro-code instructions that make use of the ALU's single cycle, parallel-add, single bit shift capability.

#### **DSP micros employ**

- Pipe lining of instructions
- Use of addressing modes that efficiently access relevant data structure (e.g., auto increment, auto decrement modes for arrays & Indexed addressing modes for FFTs)

## Dual-Bus HARVARD ARCHITECTURE, (Separate Data bus and

# Instruction bus and gateway between them) enables

- Simultaneous fetching of data and instructions
- Special DSP related addressing modes (e.g., Index computation module an arbitrary number, automatic circular queue or free data move for FIR filters, bit reversal for FFTs)
- Extra addressing, Multiple ALUs
- Special interfaces to serve specific fields of application( e.g., serial interfaces for CODEC in telecommunications)

Progress in new technologies, multiple core chips will increase performance of the future DSP microprocessors.

•TMS 32010 does = 5 MIPs (1983)

•333 MHz SIMD SHARC 32 bit processor == 2 GFLOPS (2009)

•TMS 320 C 6472 fixed point DSP (has 6 CPU + mega modules) provides 8 instruction/cycle, 1 Billion Instruction/sec and 33600 MMACS -million multiply & Accumulate cycles per second.

#### FLOATING-POINT DIGITAL SIGNAL PROCEESING CHIPS

DSP has the capability to perform floating-point arithmetic including multiply-accumulate operations with an increased degree of parallelism.

# S EEEEEE FFFFFFFFFFFFFFFFFFFFFF0 18 931

The new generation of floating point DSPs are AT&T, DSP32C, ADSP 21363, DSP96002, and Texas Instruments TMS320C6472, OMAP Lx138, Atmel magic DSP.....

A typical development system could involve an Iconic graphical interface( implemented in PC software), or A PC plugin board containing a floating point DSP micro chip.



6713 DSK

#### DSP APPLICATIONS CHARACTERSTICS

1. Algorithms are mathematically intensive e.g., for FIR filter

$$y(n) = \sum_{i=0}^{n-1} a(i) * x(n-1)$$

Where

y(n) = output samples a(i) = coefficients x(n-1) = input samples

- 2. Real time performance
  - e.g. Speech Recognition Image processing within a frame update period

## 3. Sample Input Signal

DSP processor must effectively handle sampled data in large quantities.

4. DSP processors must be flexible to accommodate changing algorithms, new DSP processors etc.

## The DSP Environment: Definitions



#### **REPEAT INSTRUCTION**

A block of instruction is repeated 'count' number of times using RPTB. RC contains the count number.

|                                                  | LDI  | 8, RC     |  |  |  |  |
|--------------------------------------------------|------|-----------|--|--|--|--|
|                                                  | RPTB | Label 1   |  |  |  |  |
|                                                  | CALL | filter    |  |  |  |  |
|                                                  | FIX  | RO        |  |  |  |  |
| Label1                                           | STI  | RO, * AR3 |  |  |  |  |
| <b>RPTB</b> instruction repeats next instruction |      |           |  |  |  |  |
| 'count' number of times                          |      |           |  |  |  |  |

### PARALLEL INSTRUCTION

#### The symbol '||' indicated parallel operation 0, RO LDF 29, AR2 LDI New Value **RPTS** AR2 \*ARO++, \*AR1++, R0 **MPYF** RO, R2, R2 ADDF Old value Parallel MPYF ---> Multiply Floating point number operation 21 Subra Ganesan

### DELAYED BRANCH

Conditional or unconditional delayed branch allows the subsequent 3 instruction to be fetched and executed. This gives the effect of single cycle branch.

| BD   | Loop; Delayed Branch |                               |  |
|------|----------------------|-------------------------------|--|
| ADDF | R0, R1               | }                             |  |
| FIX  | R1                   | <pre>} executed whether</pre> |  |
| STI  | R1, *AR3             | } branch is taken or not      |  |
| Loop |                      |                               |  |

Standard branches empty the pipeline before branching. This results in taking 4 cycles to execute branch.

#### DSP CHIPS

- Analog Devices ADSP 21000, 21020, Blackfin, Sharc..
- FreeScale MSC 8156 DSP with Starcore SC3850
- NEC uPD 77C25 (16 bit fixed pt)
- SGS Thomson ST 18 (16 bit fixed point)
- Texas Instruments TMS3201x, 2x, 3x, 4x, 5X 80, 6xx, OMAP L138, TMS320DM365 (with ARM926EJ-S)...
- Xilinx DSP FPGA
- Atmel magic DSP 1 Gflops.

#### MARKET SHARE

- TI 65%
- Freescale 15%
- AD 3%
- NEC 4%
- OTHERS 16%
- (www.edn.com/dspdirectory)

## System Considerations





## C6000 Roadmap



## Fastest MAC using Natural C



float mac(float \*m, float \*n, int count)
{ int i, float sum = 0;

for (i=0; i < count; i++) {
 sum += m[i] \* n[i]; } ...

| .**       |                     |      | *          |  |  |
|-----------|---------------------|------|------------|--|--|
| LOOP:     | ; PIPED LOOP KERNEL |      |            |  |  |
|           | LDDW                | .D1  | A4++,A7:A6 |  |  |
|           | LDDW                | .D2  | B4++,B7:B6 |  |  |
| II.       | MPYSP               | .M1X | A6,B6,A5   |  |  |
| II.       | MPYSP               | .M2X | A7,B7,B5   |  |  |
|           | ADDSP               | .L1  | A5,A8,A8   |  |  |
|           | ADDSP               | .L2  | B5,B8,B8   |  |  |
| [A1]      | В                   | .S2  | LOOP       |  |  |
| [A1]      | SUB                 | .S1  | A1,1,A1    |  |  |
| • **<br>• |                     |      | *          |  |  |
| san       |                     |      | 30         |  |  |

## WHY VLIW ?

- Ability to exploit fine-grain, instruction level parallelism by:
  - Pipelining
  - Multiple processors
  - Superscalar implementation
  - Specifying multiple independent operations per instruction.
  - Simpler way to build a superscalar microprocessor.

# Implementation advantages of VLIW

- No need for decoding and dispatching hardware that tries to reconstruct parallelism from a serial instruction stream.
- The compiler has knowledge of the source code of the program
- With sufficient registers, it is possible to mimic the functions of the superscalar implementation's reorder buffer

# Advantage of Compiler complexity over hardware

- The complexity is moved from the hardware to the software. This complexity is paid for only once, when the compiler is written instead of every time a chip is fabricated.
- Chip may cost less to design, be quicker to design, and may require less debugging.

## **Code Composer Studio**

- The Code Composer Studio (CCS) application provides an integrated environment with the following capabilities:
  - Integrated development environment with an editor, debugger, project manager, profiler, etc.
  - 'C/C++' compiler, assembly optimiser and linker (code generation tools).
  - Simulator.
  - Real-time operating system (DSP/BIOS<sup>™</sup>).
  - Real-Time Data Exchange (RTDX<sup>™</sup>) between the Host and Target.
  - Real-time analysis and data visualisation.

#### MATLAB as the platform for Signal Processing & Technical Computing

Analysis and Modeling Visualization Algorithm Development Prototyping & Simulation Application Deployment Verification & Validation





Software









Application

Deployment

#### The MathWorks

#### MATLAB for algorithm development Simulink for System & Product development



#### **MATLAB Tools for Signal Processing**

- Analysis of signals and design of filters
  - Signal Processing toolbox
  - Filter Design toolbox
- Fixed-Point representation of signals
  - Fixed-Point toolbox
- Related products
  - Wavelet, Statistics, Image Processing toolboxes
- System-level design
  - Simulink and Signal Processing Blockset
- Path to HDL implementation
  - Filter Design HDL Coder
- Hardware and software verification
  - Link products (CCS and ModelSim)

## Matlab Embedded IDE Link 4.0 and Target Support package 4.0

## (generates auto code)

Supports processors like: ARM PICCOLO– TI's low power/ low cost micro with CCS IDE C674x– Floating point DSP with CCS IDE Blackfin BF 537 EZ kit– Analog Devices DSP C5510 DSK– TI DSP low cost board.

## **Hardware Verification & Validation**

- Link for Code Composer Studio
  - TI hardware
- Link for ModelSim
  - Simulate HDL generated using ModelSim



- About MATLAB and Simulink signal processing products
  - http://www.mathworks.com/products/product\_listing/index.html
- About relevant product demos
  - http://www.mathworks.com/products/demos/index.html
- User-contributed examples in MATLAB Central
  - http://www.mathworks.com/matlabcentral

1. 6437 which offers audio and video with a fixed point processor 2. *OMAP-L137/TMS320C6747* which offers audio only with floating point. They are targeted from Real-Time Workshop. There are videos/ Wedbinars at:

http://www.mathworks.com/company/events/webinars/wbnr38640.html

# Automatic HDL code generation from filter objects

- Functionality of Filter Design HDL Coder
- Supports both VHDL and Verilog code
- Command-line with generatehdl method
- GUI-based as a target in fdatool

| 👍 Generate HDL (Direct-Form FIR, order = 50) |                                   |
|----------------------------------------------|-----------------------------------|
| HDL filter                                   |                                   |
| Filter target language: VHDL                 | ×                                 |
| Name: d13                                    |                                   |
| Target directory: hdiarc                     | Browse                            |
| Reset type: Asynchronous                     | Reset asserted level: Active-high |
| Coeff multipliers: Multiplier                | FIR adder style:                  |
| Coptimize for HDL                            | Add pipeline registers            |
| HDL Options                                  | Circk Inputs: Single 💌            |
| Test bench types                             |                                   |
| Neme filter_jb                               | V Impulse response                |
| Notice Inco In                               | Step response                     |
| VHDL tile                                    | Ramp response                     |
| Verlog tile                                  | Chirp response                    |
| j_ venog ne                                  | ✓ Whate noise response            |
| 🥅 ModelSim .do file                          | User defined response             |
| Test Bench Options                           |                                   |
|                                              | Generate Close Help               |

# Analog Devices' DSPs

Analog Devices' DSPs are:

Blackfin, SHARC, SigmaDSP, TigerSHARC, and ADSP-21xx processors

Development tools for all of the company's processors include the VisualDSP++ integrated development and debugging environment, EZ-Kit Lite evaluation kits, EZ-Boards evaluation boards, and EZ-Extender daughtercards and emulators, as well as tools from SigmaStudio, and µClinux.

The Blackfin processor family : 32-bit RISC-like instruction set with 16-bit dual MAC (multiply/accumulate) units.

The 32-bit floating/fixed-point **SHARC** processor family targets applications ranging from consumer, automotive, and professional audio to industrial, test-and-measurement, and medical equipment.

# 333 MHz **SIMD** SHARC Core, capable of 2 GFLOPS peak performance. 3<sup>rd</sup> generation SHARC.



# FreeScale DSP

## 8/16-Bit Product Roadmap

56800/E 16-Bit Hybrid Controller with DSP Capability

Growing 9S12 Family 16-Bit Performance for Control and Connectivity

**16 Bit** 

Bit

8



Leading HC(S)08 Family Integration and Portfolio



New HC08 Family Nitron Very Low Cost Control Low to High-end CodeWarrior Development Environment

## 56800/E Digital Signal Controller Roadmap



## 56F8300/8100 Target Markets



#### General

- Bill validators
- Medical instrumentation
- Intelligent toys
- Metering
- Retail scanners
- Exercise equipment
- Security and safety systems
- Vending machines
- Home automation
- Performance migration for 56F800 customers

#### Automotive

- EPAS (Electronic Power Assisted Steering)
- Braking
- Transmission
- Active suspension
- Valve actuators
- Engine performance modules

#### Industrial

- UPS (Uninterruptible Power Supply)
- Power supplies
- Frequency inverters
- Protection relay
- · Sensorless control
- · Valve actuators
- Compressors









# Xilinx DSP

The Xilinx Virtex®-6 FPGA DSP Kit brings development tools, methodologies, IP and support together into solutions that accelerate development for experienced users and simplifies the adoption of FPGAs for new users.

Xilinx ML 605 Development Board including Virtex-6 LX240T FPGA ISE® Design Suite 11.4 System Edition (device-locked to Virtex-6 LX240T FPGA)

ISE Design Tools EDK and System Generator for DSP™ Simulink based Digital Up Converter (DUC)/Digital Down Converter (DDC)

## The XtremeDSP<sup>™</sup> Development Kit – Virtex®-5 DSP

It is a comprehensive development kit that includes hardware, design tools, IP, and pre-verified reference designs that can rapidly accelerate the development of your next DSP application.

Along with the versatile ML506 platform, included in this kit is a full license of XtremeDSP Development Tools Package which includes System Generator for DSP and AccelDSP synthesis tool.

This helps users of MATLAB® and Simulink® (The Mathworks, Inc.) to create high-performance systems using Xilinx FPGAs.

## **Domain Optimized Platforms**

#### One Family – Multiple Platforms



## Domain-Specific Design: System Generator for DSP

Library-based, visual data flow Polymorphic operators Arbitrary precision fixed-point Bit and cycle true modeling Seamlessly integrated with MATLAB/Simulink

> Including test bench and data analysis

Automatic code generation

Synthesizable VHDL

IP cores

HDL test bench

Drojact and constraint files



#### At Xilinx, we do SoC design so you don't have to.

FPL 2007 27

**S**XILINX<sup>®</sup>

## **Integrated DSP Slice**

## 250 MHz implementation

- Fast multiplier & 48 bit adder
- ASIC-like performance
- Input and output registers for higher speed

#### XtremeDSP DSP48A1 Slice



#### **Optimizes FIR filter applications**



## Larger Number of DSP Hard Blocks

**Die Area Efficiency Improvement Example** 

#### DSP Cost Advantage Effective Logic Cell Increase Equivalent LCs when One DSP48A1 Blocks used\* Logic Cells DSP48A1 6SLX150 Block (180 DSP48A1 Blocks) Equivalent to ... 6SLX9 18x18 (16 DSP48A1 Blocks) Multiplier & 300 Logic Cells\* 10K 15K 150K >200K Density in Logic Cells (LC) \* 300 is typical actual range dependent upon \* Assuming 300 LC savings & all DSP48A1 blocks used application (range 50-1300 logic cells saved)

#### Abundant DSP48A1 Hard Blocks Enable Lower Cost by Increasing Effective Density by over 40%

**E** XILINX.

## **Example: Mercedes S-Class**

#### **18 XILINX DEVICES IN EACH VEHICLE IF ALL OPTIONS ORDERED**





## **Embedded Commitment**



Xilinx Confidential

- ----



- 2

# SOC FPGA from Altera



Altera SoC FPGAs integrate a dual-core ARM<sup>®</sup> Cortex<sup>™</sup>-A9 MPCore<sup>™</sup> processor, memory controllers, and a rich set of peripherals with Cyclone<sup>®</sup> V and Arria<sup>®</sup> V-class FPGAs tightly coupled via a high-bandwidth interconnect backbone. This user-customizable ARM-based system-on-a-chip combines the performance and power savings of hard IP, with the flexibility of programmable logic, and robust software ecosystem of the ARM architecture.

# Five Reasons to Design with an SoC FPGA

- Reduce Board Size Integrating the FPGA, microprocessor, and DSP functions in a single chip lets you reduce the number of devices on your board, minimizing board size and complexity.
- Lower Power Consumption Take advantage of SoC FPGAs that leverage the Altera-optimized low-power 28-nm (28LP) process technology, a rich set of hard IP, and integrated low-power serial transceivers.
- Reduce Total System Cost Reduce your bill of materials costs with fewer discrete devices, power supply rails, and oscillators required.
- Design with FPGA Flexibility Choose from a broad range of soft IP cores from Altera and third-party IP partners to quickly create a custom ARM processor system. Adapt to changing industry standards and market requirements with the flexible FPGA fabric. Quickly create custom hardware designs with the Quartus® II design software and Qsys system integration tool.
- Common Development Tools Leverage the extensive ARM ecosystem of software development tools, operating systems, and middleware.



Complete "high-end" collision-avoidance system

## DSP in Automotive Embedded System

- DSP TMS 320 F28x for Electric Power Steering
- DSP for misfire detection in real time.



Setupassin an Automobile



**Control Block** 

## Watermarking----Spatial Domain Technique:

- The spatial-domain techniques directly modify the intensities or color values of some selected pixels
- Watermark Embedding



## Results:

## Input :



2004 S A Output:



Figure 3 Original Host Image of size 128x128 Figure 4 Original Signature Image of size 64x64 Figure 4 watermarked image of size 128x128

## Watermarking Applications

- Copyright Protection(Proof of Ownership)
  - To prove ownership
- Copy Control(Fingerprinting)
  - To trace illegal copies: Each copy has its own serial number
  - License agreement
- Data Authentication
  - Check if content is modified
- Data Hiding
  - Providing private secret messages
- Broadcasting Monitoring(Internet, TV, Radio...)
  - For commercial advertisement

### DSP Sonar System



DSP based Sonar devices for Robot Navigation.



Sonar: Sound Navigation and Ranging

## **Active Noise Cancellation**



# Automotive Infotainment



| LEGEND    | Logic 🔤 |
|-----------|---------|
| Processor | Power   |
| Interface | ADC/DAC |
| RF/IF     | Clocks  |
| Amplifier | Other   |

# **Baby Monitor**



# **FingerPrint Identification**





# Android Meets Beagle

Jay J. Williams August 12, 2009

0/0 VOU A

# Agenda

- What is a BeagleBoard
- Shopping List
- Building Android
- Preparing SDCard
- Booting
- Summary





# **TI-OMAP3530 Processor**

#### Application Processor

- 600 MHz ARM Cortex<sup>™</sup> A8 Core
- ARMv7 Architecture
- 16KB I-Cache; 16KB D-Cache; 256KB L2
- NEON<sup>™</sup> SIMD Coprocessor
- DSP Core
  - TMS320C64x DSP
  - L1 32KB Program Cache + 80KB Data Cache
  - L2 64K Program/Data Cache + 32KB SRAM
  - Video Hardware Accelerators
- Graphics Core
  - PowerVR SGX Graphics Accelerator
  - Tile Based Architecture: 10 MPoly/Sec
- On Chip Memory: 64KB SRAM



# **BeagleBoard Hardware**

**OMAP3530** Processor

- 600MHz Cortex-A8
  - NEON+VFPv3
  - 16KB/16KB L1\$
  - 256KB L2\$
- 430MHz C64x+ DSP
  - 32K/32K L1\$
  - 48K L1D
  - 32K L2
- PowerVR SGX GPU
- 64K on-chip RAM

POP Memory128MB LPDDR RAM256MB NAND flash





Peripheral I/O
Expansion Header
LCD Header
DVI-D video out
SD/MMC+
S-Video out
USB 2.0 Host
USB 2.0 HS OTG
I<sup>2</sup>C, I<sup>2</sup>S, SPI, MMC/SD
JTAG

- Stereo in/out
- Alternate power
- RS-232 serial

#### **USB** Powered

- 2W maximum consumption
  - OMAP is small % of that
- Many adapter options
  - Car, wall, battery, solar, …

# BeagleBoard xm

- Digikey \$149.00
  - http://dkc1.digikey.com/us/mkt/beagleboard.html



# Conclusion



**DSPs have a huge number of applications.** 

Technology is progressing very fast and computing power is increasing.

Multiple Core CPUs and Parallel Programming are the immediate future of DSP based Embedded Systems.

