Digital Signal Processing Teaching Platform
DSP Teaching Lab — Graduate Level
Covering Hilbert Spaces, Distribution Theory, Z-Transform, MUSIC/ESPRIT, Wavelet Analysis, Wigner-Ville Distribution
Communications · Radar · Imaging · Biomedical — Four Major Engineering Application Areas
Learning Path
Not sure where to start? Choose a path based on your goal:
🎯 Core Required Path (Recommended for All)
📡 Communications Engineering Path
Core → 2B.4 Z-Transform → 5A Decimation/Interp → 4A FIR → 4B IIR → 9C OFDM
🔬 Biomedical Signal Path
Core → 5A Decimation/Interp → 3.2 Welch → 4.1 Hilbert → 5.4 CWT → 6.9 EEG/ECG
⚙️ Vibration/Mechanical Path
Core → 5A Decimation/Interp → 3.2 Welch → 4.1 Hilbert → 4.2 Envelope → 6.10 Vibration
📡 Radar/Array Path
Core → 5B Polyphase → 3.4 MUSIC → 6.7 Radar → 6.8 Array
Rigorous Mathematics
L² Spaces · Distribution Theory · Full Derivations
Advanced Theory
MUSIC · Wavelets · Wigner-Ville · EMD
Four Major Applications
Communications OFDM · Radar · Imaging · Biomedical
📊 Real-World Datasets
Don't limit yourself to synthetic sine-wave practice. Below are public datasets you can download to run a complete DSP pipeline on real signals.
🫀 Biomedical Signals
- PhysioNet (physionet.org): clinical ECG, EEG, EMG, PPG data
- MIT-BIH Arrhythmia DB: classic ECG arrhythmia dataset
- Sleep-EDF: EEG sleep staging
- Suitable for: m6-1 Hilbert (R-peak detection), m7-1 STFT (EEG bands), m9-9 EEG/ECG
🔊 Audio
- ESC-50: 50 classes of environmental sounds (5 sec each)
- UrbanSound8K: urban sound classification
- LibriSpeech: 1000 hours of speech corpus
- Suitable for: m3b-1 windowing, m7-1 STFT, m4-* filter design
⚙ Mechanical Vibration
- Case Western Bearing Data: classic bearing fault dataset
- NASA IMS Bearing Dataset: run-to-failure bearing data
- MFPT Bearing Fault: multiple rotation-speed conditions
- Suitable for: m6-2 envelope spectrum, m9-10 vibration analysis, Phase 1 BPFO computation
📡 Communications / Radar
- RadioML 2018: IQ samples across many modulation schemes
- FMCW Radar Dataset: autonomous-driving radar data
- GNU Radio Tutorials: SDR examples
- Suitable for: m9-6 OFDM, m9-7 radar, communications receivers
💡 Getting started: The easiest entry point is PhysioNet ECG — the data is clean, has clear features (R-peaks), and lets you practice the full pipeline from m6-1 Hilbert all the way to m9-9.
1.1 Hilbert Space & $L^2$ Theory
The mathematical framework of Fourier analysis — Why can sinusoids form a "basis"?
⚠ Math Prerequisites Recommended
This module (M1 Mathematical Foundations) requires the following background:
- Linear algebra: inner products, orthogonality, eigenvalues, vector spaces
- Real analysis basics: limits, continuity, Cauchy sequences, series convergence
- Complex arithmetic: Euler's formula, complex exponentials, complex conjugates
- Basic topology (1.1, 1.3): completeness, density, modes of convergence
If you're not familiar with these: feel free to skip M1 and start directly from M2A.1 Fourier Series. M1 is the rigorous foundation for "why Fourier analysis works," but even without it, you can still correctly use every tool. M1 is for advanced students who want to understand the mathematical essence.
Why does this matter? Because the answers to questions like "Why does the energy computed by FFT equal the time-domain energy?" and "Why can sinusoids serve as a basis?" are all hidden in Hilbert space theory. It is the foundation of the entire Fourier analysis edifice — you don't need to think about it every day, but understanding it will give you deeper confidence in every subsequent tool.
One-line summary: Fourier analysis works because sinusoids form an orthogonal basis in a space called $L^2$ — just like the $x, y, z$ axes in 3D space.
Learning Objectives
- Define inner product spaces, norms, and completeness; understand the axiomatic structure of Hilbert spaces
- Recognize $L^2[0,T]$ as the natural function space for Fourier analysis
- Prove that the complex exponentials $\{e^{jn\omega_0 t}\}$ form an orthonormal basis in $L^2$
- Derive Parseval's identity from the inner product, establishing a rigorous foundation for "energy conservation"
The Problem: Questions Behind FFT You Never Asked
Every engineer uses the FFT for spectrum analysis. But have you ever wondered:
- Why can sinusoids serve as a "basis"? Who decided that frequency components must be sinusoids rather than some other waveform?
- Parseval's theorem says time-domain energy = frequency-domain energy — where does that come from? Is it an approximation or exact?
- Is the FFT just an approximation? A superposition of infinitely many sinusoids, truncated to finite terms — is it still "correct"?
The answers lie in Hilbert space theory. Once you understand it, you'll see that the FFT is not merely an algorithm, but the numerical implementation of a profound mathematical theorem.
Historical context: In the late 19th to early 20th century, David Hilbert (1862–1943) discovered that "infinite-dimensional vector spaces" required a rigorous mathematical foundation while studying integral equations. Frigyes Riesz and Ernst Fischer proved the completeness of $L^2$ spaces in 1907 (the Riesz-Fischer theorem), providing the most elegant explanation for the convergence of Fourier series. John von Neumann later axiomatized Hilbert spaces, making them the common language of quantum mechanics and signal processing.
Principles: From 3D Vectors to Function Spaces
Intuition first: In three-dimensional space, any vector $\vec{v}$ can be decomposed into components along the $x, y, z$ directions:
$$\vec{v} = v_x\hat{x} + v_y\hat{y} + v_z\hat{z}$$Each component is obtained by projection (inner product): $v_x = \vec{v}\cdot\hat{x}$. Fourier analysis does exactly the same thing, except the three-dimensional space is replaced by a "function space," and $\hat{x}, \hat{y}, \hat{z}$ are replaced by $e^{jn\omega_0 t}$.
| Concept | 3D Vector Space $\mathbb{R}^3$ | Function Space $L^2[0,T]$ |
|---|---|---|
| Elements | Vector $\vec{v}$ | Function (signal) $f(t)$ |
| Inner Product | $\vec{u}\cdot\vec{v} = \sum u_i v_i$ | $\langle f,g\rangle = \frac{1}{T}\int_0^T f\bar{g}\,dt$ |
| Magnitude | $|\vec{v}| = \sqrt{\vec{v}\cdot\vec{v}}$ | $\|f\| = \sqrt{\langle f,f\rangle}$ (RMS value) |
| Orthogonal Basis | $\hat{x}, \hat{y}, \hat{z}$ | $e^{jn\omega_0 t}$, $n \in \mathbb{Z}$ |
| Projection (Coordinates) | $v_x = \vec{v}\cdot\hat{x}$ | $c_n = \langle f, e^{jn\omega_0 t}\rangle$ (Fourier coefficients) |
| Energy Conservation | $|\vec{v}|^2 = v_x^2 + v_y^2 + v_z^2$ | $\|f\|^2 = \sum|c_n|^2$ (Parseval) |
Rigorous Definition
Let $V$ be a complex vector space. An inner product $\langle \cdot,\cdot\rangle : V \times V \to \mathbb{C}$ satisfies:
The norm induced by the inner product: $\|f\| = \sqrt{\langle f,f\rangle}$.
If the inner product space is complete under this norm (every Cauchy sequence converges to an element within the space), it is called a Hilbert space.
$L^2[0,T]$: The Space of Square-Integrable Functions
The natural habitat of Fourier analysis is the $L^2$ space:
Inner product defined as: $\displaystyle\langle f, g \rangle = \frac{1}{T}\int_0^T f(t)\,\overline{g(t)}\,dt$
Physical meaning of $L^2$: signals with finite energy. $\|f\|^2 = \langle f,f\rangle = \frac{1}{T}\int_0^T |f(t)|^2\,dt$ is the average power.
Intuition: $L^2$ is like an infinite-dimensional "vector space." Each signal is a "vector" in this space, the inner product measures the "similarity" between two signals, and the norm measures the "magnitude" (energy) of a signal. Fourier analysis is simply performing orthogonal projection in this space.
Orthonormal Basis: $\{e^{jn\omega_0 t}\}_{n\in\mathbb{Z}}$
Let $\phi_n(t) = e^{jn\omega_0 t}$, $\omega_0 = 2\pi/T$. Key theorem: this set of functions forms an orthonormal basis of $L^2[0,T]$.
Show orthogonality proof
Compute $\langle \phi_n, \phi_m \rangle$:
$$\langle \phi_n, \phi_m \rangle = \frac{1}{T}\int_0^T e^{jn\omega_0 t}\,\overline{e^{jm\omega_0 t}}\,dt = \frac{1}{T}\int_0^T e^{j(n-m)\omega_0 t}\,dt$$Case 1: $n = m$
$$\frac{1}{T}\int_0^T 1\,dt = 1$$Case 2: $n \neq m$
$$\frac{1}{T}\left[\frac{e^{j(n-m)\omega_0 t}}{j(n-m)\omega_0}\right]_0^T = \frac{1}{T}\cdot\frac{e^{j(n-m)2\pi} - 1}{j(n-m)\omega_0} = \frac{1-1}{j(n-m)\omega_0} = 0$$Because $e^{j(n-m)2\pi} = 1$ (integer number of full rotations).
Conclusion: $\langle \phi_n, \phi_m \rangle = \delta_{nm}$ (Kronecker delta), orthonormal. $\;\blacksquare$
Therefore, any $f \in L^2[0,T]$ can be expanded as:
$c_n$ is the orthogonal projection coefficient of $f$ onto the basis $\phi_n$ — a perfect analogy to the coordinates of a finite-dimensional vector.
Parseval's Identity and Bessel's Inequality
Parseval's Identity
$$\|f\|^2 = \frac{1}{T}\int_0^T |f(t)|^2\,dt = \sum_{n=-\infty}^{\infty} |c_n|^2$$Show Parseval's identity derivation
Expand $\|f\|^2 = \langle f, f \rangle$:
$$\langle f, f \rangle = \left\langle \sum_n c_n \phi_n,\; \sum_m c_m \phi_m \right\rangle = \sum_n \sum_m c_n \overline{c_m} \langle \phi_n, \phi_m \rangle$$Using orthogonality $\langle \phi_n, \phi_m \rangle = \delta_{nm}$:
$$= \sum_n \sum_m c_n \overline{c_m}\,\delta_{nm} = \sum_n c_n \overline{c_n} = \sum_n |c_n|^2 \quad\blacksquare$$Physical meaning: Total energy (power) computed in the time domain = sum of energies of all frequency components. Energy is conserved under orthogonal decomposition.
Bessel's inequality: If only $N$ finite terms are taken, then $\sum_{|n|\leq N} |c_n|^2 \leq \|f\|^2$. Equality holds as $N\to\infty$ (provided $\{\phi_n\}$ forms a complete basis).
How to Use: An Engineer's Practical Perspective
You don't need to think about Hilbert spaces every day, but they explain the following engineering facts:
| Engineering Question | Hilbert Space Explanation |
|---|---|
| FFT output energy = time-domain energy? | Parseval's identity: orthogonal decomposition preserves energy |
| Why does FFT use sinusoids instead of square waves? | $\{e^{jn\omega_0 t}\}$ is an orthonormal basis in $L^2$ |
| Is truncation to $N$ terms the "best approximation"? | Orthogonal projection theorem: partial sums are the best $L^2$ approximation |
| Why does the minimum MSE filter use projection? | Orthogonal projection in Hilbert space = MSE minimization |
Interactive: Orthogonality Verification
Compute the real part of $\langle e^{jm\omega_0 t}, e^{jn\omega_0 t}\rangle$. When $m \neq n$, the integral is zero (orthogonal); when $m = n$, it equals 1.
Pitfalls: "Equality" in $L^2$ Is Not Pointwise Equality
In $L^2$, two functions $f$ and $g$ being "equal" means $\|f - g\| = 0$, i.e., $\int|f-g|^2\,dt = 0$. This allows them to differ on measure-zero sets. For example, $f(t) = 0$ and $g(t) = \begin{cases}1 & t=0 \\ 0 & \text{else}\end{cases}$ are "the same function" in $L^2$. This is why Fourier series can "fail to converge to the correct value" at discontinuities, yet still converge perfectly in the $L^2$ sense.
References: [1] Kreyszig, Introductory Functional Analysis with Applications, Ch.3. [2] Rudin, Real and Complex Analysis, Ch.4. [3] Oppenheim & Willsky, Signals and Systems, Ch.3.
✅ Quick Check
Q1: Why does the energy of the FFT spectrum equal the time-domain energy? Explain in one sentence.
Show answer
Because {e^{jnω₀t}} is an orthonormal basis of L², and Parseval's identity guarantees energy conservation under orthogonal decomposition.
Q2: If two signals have identical FFT spectra, must their time-domain waveforms be the same?
Show answer
In the L² sense, yes (equal almost everywhere). But they may differ at finitely many points (differences on measure-zero sets are ignored in L²).
1.2 Generalized Functions & Distribution Theory
Why does the Fourier transform of $\sin(\omega_0 t)$ "exist"?
Why does this matter? Because textbooks say the FT of cos is two deltas, but delta is not a function — what is it exactly? Distribution theory gives the "intuitive operations" of physicists and engineers a rigorous foundation, and is the mathematical bedrock for understanding sampling, impulse response, and related concepts.
Previously... In 1.1 we established the Hilbert space framework — sinusoids are orthogonal bases, Parseval guarantees energy conservation. But some signals (like sin(ωt)) have infinite energy and are not in L². What do we do?
One-line summary: The energy of $\sin(\omega t)$ is infinite, so the classical Fourier Transform cannot handle it. Distribution Theory solves this problem using the $\delta$ function.
Learning Objectives
- Understand why the classical FT fails for signals like $\sin$, constants, etc.
- Recognize the Dirac delta as a functional, not a function
- Rigorously derive $\mathcal{F}\{\delta\}=1$, $\mathcal{F}\{1\}=2\pi\delta(\omega)$, $\mathcal{F}\{\cos\omega_0 t\}$
- Confidently use the various properties of $\delta(t)$ in engineering
The Problem: The "Mysterious Delta" in Textbooks
Every signal processing textbook writes:
But wait — the $\delta$ function has a value of "infinity" at $t=0$ and zero everywhere else? No classical function looks like this. Why do textbooks write it so confidently? Can engineers use it safely? Could it lead to errors in some calculation?
Distribution theory's answer: yes, you can use it safely, because $\delta$ is not a function — it is a functional. Once you understand this distinction, all the "mysterious operations" have a rigorous foundation.
Historical context: Physicist Paul Dirac extensively used the $\delta$ function in his 1930s quantum mechanics research to represent the state of a particle at a specific position. Mathematicians were uneasy — Dirac's operations were illegitimate in classical analysis. From 1944 to 1950, French mathematician Laurent Schwartz developed the Theory of Distributions, placing Dirac's intuitive operations on a rigorous functional analysis foundation. Schwartz received the Fields Medal in 1950 for this work. He himself said: "All I did was translate into mathematical language what physicists already knew."
Principles: From the Limitations of Classical FT to Distributions
Step 1: Where Does the Classical FT Fail?
The CTFT requires $\int_{-\infty}^{\infty}|x(t)|\,dt < \infty$ (absolutely integrable) or at least $\int|x|^2\,dt < \infty$ ($L^2$). But the most fundamental signals in engineering violate this condition:
| Signal | $\int|x(t)|\,dt$ | $\int|x(t)|^2\,dt$ | Classical FT? |
|---|---|---|---|
| $x(t) = A$ (DC) | $\infty$ | $\infty$ | Does not exist |
| $x(t) = \cos(\omega_0 t)$ | $\infty$ | $\infty$ | Does not exist |
| $x(t) = u(t)$ (unit step) | $\infty$ | $\infty$ | Does not exist |
| $x(t) = e^{j\omega_0 t}$ | $\infty$ | $\infty$ | Does not exist |
These signals share a common trait: infinite duration, infinite energy. Yet they are the most common signals in engineering!
Step 2: The Key Idea — Delta Is a Functional, Not a Function
Intuition: Imagine a "probe" $\varphi(t)$ — it is very smooth and decays rapidly to zero at infinity. We don't directly ask what the value of $\delta(t)$ is at a certain point (that question is meaningless), but rather: "What is $\delta$'s response to this probe?"
Rigorous Definition of the Dirac Delta
$$\delta[\varphi] = \varphi(0), \quad \forall\, \varphi \in \mathcal{S}$$"$\delta$ is a machine: you feed in any test function $\varphi$, and it outputs the value of $\varphi$ at zero."
$\delta$ is not a "function" — no classical function $f(t)$ can simultaneously satisfy $\int f(t)\varphi(t)\,dt = \varphi(0)$ and $f(t) = 0$ for $t \neq 0$. It is a pure functional — a mapping that "eats functions and outputs numbers."
Intuitive analogy: Think of distributions as "generalized functions." An ordinary function $f(t)$ can also be viewed as a functional: $f[\varphi] = \int f(t)\varphi(t)\,dt$. But distributions allow more "singular" objects to exist — $\delta$ is the most famous example.
Step 3: Schwartz Space (Intuition Is Enough)
The "probe" $\varphi$ above comes from Schwartz space $\mathcal{S}$ — infinitely differentiable and rapidly decreasing functions. You don't need to memorize the exact definition, just know:
- Functions in $\mathcal{S}$ are very "well-behaved": smooth, rapidly decaying, and remain well-behaved no matter how many times you differentiate
- The Gaussian $e^{-t^2}$ is a typical member of $\mathcal{S}$
- Tempered Distribution $T \in \mathcal{S}'$: a continuous linear functional on $\mathcal{S}$
Basic Properties of the Delta Function
The following properties can all be rigorously derived from the functional definition; use them directly in engineering:
Fourier Transform of Distributions: Three-Step Derivation
For $T \in \mathcal{S}'$, define its FT $\hat{T} \in \mathcal{S}'$ as:
(Transfer the FT work to smooth test functions — their FT always exists)
Derive $\mathcal{F}\{\delta(t)\} = 1$
Let $\hat{\varphi}(\omega) = \int_{-\infty}^{\infty}\varphi(t)\,e^{-j\omega t}\,dt$ be the classical FT of $\varphi$.
By definition:
$$\hat{\delta}[\varphi] = \delta[\hat{\varphi}] = \hat{\varphi}(0) = \int_{-\infty}^{\infty}\varphi(t)\,e^{0}\,dt = \int_{-\infty}^{\infty}\varphi(t)\,dt$$The constant function $1$ acting as a distribution: $1[\varphi] = \int 1 \cdot \varphi(t)\,dt$
The two are identical, therefore $\hat{\delta} = 1$.
Physical meaning: An infinitely narrow impulse contains all frequencies, each with equal amplitude. $\;\blacksquare$
Derive $\mathcal{F}\{1\} = 2\pi\delta(\omega)$
Using duality. If $\mathcal{F}\{f(t)\} = F(\omega)$, then $\mathcal{F}\{F(t)\} = 2\pi f(-\omega)$.
Since $\mathcal{F}\{\delta(t)\} = 1$, duality gives:
$$\mathcal{F}\{1\} = 2\pi\delta(-\omega) = 2\pi\delta(\omega)$$(The last step used the symmetry of $\delta$: $\delta(-\omega) = \delta(\omega)$.)
Physical meaning: An eternally constant DC signal contains only the "zero frequency" component. $\;\blacksquare$
Derive $\mathcal{F}\{\cos(\omega_0 t)\}$ — The Ultimate Goal
Step 1: From $\mathcal{F}\{1\} = 2\pi\delta(\omega)$ plus the frequency shift property $\mathcal{F}\{e^{j\omega_0 t}f(t)\} = F(\omega - \omega_0)$:
$$\mathcal{F}\{e^{j\omega_0 t}\} = 2\pi\delta(\omega - \omega_0)$$Step 2: Using Euler's formula $\cos(\omega_0 t) = \frac{1}{2}(e^{j\omega_0 t} + e^{-j\omega_0 t})$:
$$\mathcal{F}\{\cos(\omega_0 t)\} = \frac{1}{2}\cdot 2\pi\delta(\omega - \omega_0) + \frac{1}{2}\cdot 2\pi\delta(\omega + \omega_0)$$ $$\boxed{\mathcal{F}\{\cos(\omega_0 t)\} = \pi[\delta(\omega - \omega_0) + \delta(\omega + \omega_0)]}$$Physical meaning: The spectrum of a pure sinusoid consists of two "needles" located exactly at $\pm\omega_0$. This perfectly matches engineering intuition — a pure tone has only one frequency. $\;\blacksquare$
How to Use: Engineer's Practical Guide
Distribution theory guarantees the legitimacy of the following operations — you can use them with confidence:
| Operation | Mathematics | Engineering Meaning |
|---|---|---|
| Sampling = multiply by impulse train | $x_s(t) = x(t)\sum_n\delta(t-nT_s)$ | Mathematical model of ADC |
| Impulse response definition | $h(t) = T\{\delta(t)\}$ | LTI system fully determined by $h(t)$ |
| Discrete spectrum | $\mathcal{F}\{\cos\omega_0 t\} = \pi[\delta(\omega\pm\omega_0)]$ | "Spectral lines" on a spectrum analyzer |
| FT of a constant | $\mathcal{F}\{A\} = 2\pi A\,\delta(\omega)$ | DC offset = zero-frequency component |
Engineer's quick note: Seeing $\delta$ in the frequency domain means that frequency has energy concentrated in an "infinitely narrow but finite-area" manner. The area of $\delta(\omega - \omega_0)$ is 1 (integral equals 1), representing a pure frequency component.
Applications
- Communications systems: Spectrum analysis of carrier modulation $x(t)\cos(\omega_c t)$. The FT of $\cos$ is two deltas, and multiplication corresponds to frequency-domain convolution — this is the mathematical foundation of frequency shifting. In 5G NR OFDM with 15 kHz subcarrier spacing, each subcarrier's spectrum is a $\delta$.
- Control systems: The impulse response $h(t)$ is defined via $\delta(t)$. PID controller tuning starts from $h(t)$.
- Digital signal processing: The sampling process is modeled as $x(t)\cdot\sum\delta(t-nT_s)$, directly leading to the sampling theorem. The CD's 44.1 kHz sampling rate was derived from this model.
Pitfalls & Limitations
- Do not apply nonlinear operations to delta: $\delta(t)^2$, $\sqrt{\delta(t)}$ have no meaning. Distributions only support linear operations (addition, scalar multiplication, differentiation, convolution, FT).
- Do not treat the "value" of $\delta$ as a number: $\delta(0) = \infty$ is just a heuristic statement; strictly speaking, $\delta$ has no "value" at any point.
- Products of $\delta$ are sometimes undefined: $\delta(t)\cdot\delta(t)$ is undefined in distribution theory. Only "distribution $\times$ smooth function" is meaningful.
References: [1] Strichartz, A Guide to Distribution Theory and Fourier Transforms, CRC Press. [2] Folland, Real Analysis, Ch.8-9. [3] Schwartz, Theorie des Distributions, 1950.
✅ Quick Check
Q1: Why can't we use the classical FT for sin(ωt)?
Show answer
Because ∫|sin(ωt)|dt = ∞, the absolute integrability condition is not met. Distribution theory is needed, yielding π[δ(ω-ω₀)+δ(ω+ω₀)].
Q2: Is δ(t) a function?
Show answer
No. It is a functional (distribution), defined as δ[φ]=φ(0). No classical function can do this.
Interactive: δ Function as a Limit
δ(t) is not a function, but it can be viewed as the limit of "narrower and taller" Gaussians: $\delta_\sigma(t) = \frac{1}{\sigma\sqrt{2\pi}}e^{-t^2/(2\sigma^2)}$. Observe what happens as $\sigma \to 0$: the height approaches infinity, but the area is always exactly 1.
1.3 Convergence Theory
What does it really mean for a Fourier series to "converge"?
Why does this matter? Because Fourier series "convergence" has four different meanings, and confusing them leads to engineering errors. Understanding convergence rate = understanding why square waves need large bandwidth and why smooth pulses save bandwidth. The Gibbs phenomenon directly explains DAC ringing and intersymbol interference in digital communications.
Previously... Section 1.2 used distribution theory to solve the 'FT of infinite-energy signals' problem. But another question arises: do the partial sums of a Fourier series truly converge to the original function? In what sense?
One-line summary: Adding more terms to a Fourier series brings it closer to the original function — but "closer" has four different meanings, and in engineering you usually only need to care about energy-sense convergence ($L^2$ convergence).
Learning Objectives
- Distinguish between pointwise, uniform, $L^2$, and almost-everywhere convergence modes
- State the Dirichlet conditions and Carleson's theorem
- Quantify the relationship between convergence rate and signal smoothness
- Rigorously analyze the 8.95% overshoot of the Gibbs phenomenon
The Problem: Gibbs Ringing of the Square Wave
You've certainly seen this phenomenon: when reconstructing a square wave with a Fourier series, there is pronounced ringing near the discontinuities. Even with hundreds of terms, the overshoot never disappears — it's approximately 9% of the square wave amplitude.
- DAC output: When a digital-to-analog converter reconstructs a square wave signal, real physical ringing occurs
- Image processing: "Mosquito noise" near sharp edges in JPEG is the Gibbs effect caused by frequency-domain truncation
- Filter design: The impulse response of an ideal lowpass filter = sinc function (infinite length); truncation causes passband ripple
Why doesn't the ringing disappear? What type of convergence issue is this? The answer lies in the details of convergence theory.
Historical context: The Gibbs phenomenon is named after J. Willard Gibbs (discovered in 1899), but English mathematician Henry Wilbraham had already described it as early as 1848. Dirichlet (1829) first gave sufficient conditions for pointwise convergence of Fourier series. The most profound result came from Lennart Carleson (1966): he proved that the Fourier series of $L^2$ functions converge "almost everywhere" — a result so difficult that Carleson received the 2006 Abel Prize for it.
Principles: Four Meanings of "Closeness"
Intuition first: Imagine you took a photo (original function) and approximate it with more and more pixels. "Good approximation" can have different criteria:
- Pointwise: Every pixel is correct
- Uniform: Even the pixel with the largest error tends to zero
- $L^2$: Total error energy tends to zero (individual pixel deviations allowed)
- Almost everywhere: Every pixel is correct except for finitely many bad ones
Let $S_N(t) = \sum_{|n|\leq N} c_n e^{jn\omega_0 t}$ be the partial sum. The meaning of convergence to $f(t)$ depends on the convergence mode:
| Mode | Definition | Condition | Engineering Meaning |
|---|---|---|---|
| Pointwise | $\forall t$: $S_N(t) \to f(t)$ | Dirichlet conditions | Converges at every time point |
| Uniform | $\sup_t |S_N(t)-f(t)|\to 0$ | $f$ continuous + absolutely convergent | Strongest; impossible for square waves (Gibbs) |
| $L^2$ (Mean-Square) | $\int|S_N-f|^2 dt \to 0$ | $f \in L^2$ (always holds) | Energy-sense convergence; most commonly used |
| Almost Everywhere (a.e.) | $S_N(t)\to f(t)$ except on measure-zero set | $f\in L^2$ (Carleson 1966) | Converges except at negligible points |
Key insight: The Fourier series of a square wave converges perfectly in the $L^2$ sense (total energy error → 0), but does not converge uniformly near discontinuities (Gibbs overshoot persists forever). This is not a contradiction — these are two different measurement criteria.
Convergence Rate and Smoothness
Key theorem: The smoother the signal, the faster the Fourier coefficients decay.
| Signal | Continuity | $c_n$ Decay | 10-Term Approx. Error | Convergence Speed |
|---|---|---|---|---|
| Square wave | Discontinuous | $O(1/n)$ | ~10% | Slow (Gibbs) |
| Triangle wave | $C^0$ (continuous but not differentiable) | $O(1/n^2)$ | ~1% | Moderate |
| Parabolic wave | $C^1$ | $O(1/n^3)$ | ~0.1% | Fast |
| $C^\infty$ function | Infinitely differentiable | Super-algebraic decay | $<10^{-10}$ | Extremely fast |
Engineering insight: This explains why: (1) Square waves in digital communications need very large bandwidth (slow harmonic decay), while smooth raised-cosine pulse shaping requires only finite bandwidth. (2) Adding a smoothing filter at the DAC output can dramatically reduce required bandwidth. (3) Sigma-delta modulator noise shaping exploits this principle.
Rigorous Analysis of the Gibbs Phenomenon
Near the discontinuity of a square wave, the maximum overshoot of $S_N$ approaches:
Derive the Gibbs overshoot ratio
The partial sum of the square wave $f(t) = \text{sgn}(\sin t)$ can be written as:
$$S_N(t) = \frac{4}{\pi}\sum_{k=0}^{N-1}\frac{\sin((2k+1)t)}{2k+1}$$Near $t = \pi/(2N+1)$ (the first overshoot point), $S_N$ achieves its maximum. It can be shown that:
$$S_N\!\left(\frac{\pi}{2N+1}\right) = \frac{2}{\pi}\sum_{k=0}^{N-1}\frac{\sin\left(\frac{(2k+1)\pi}{2(2N+1)}\right)}{2k+1} \cdot \frac{2}{2N+1}\cdot(2N+1)$$As $N\to\infty$, this Riemann sum approaches:
$$\frac{2}{\pi}\int_0^{\pi}\frac{\sin u}{u}\,du = \frac{2}{\pi}\cdot\text{Si}(\pi) \approx \frac{2}{\pi}(1.8519) \approx 1.1790$$The ideal square wave value is 1, so the overshoot $\approx 17.90\%$ of half-amplitude = $8.95\%$ of full amplitude.
Key point: This $8.95\%$ is independent of $N$ — adding more terms will never make it disappear. Increasing $N$ only moves the overshoot closer to the discontinuity, but the height remains unchanged. $\;\blacksquare$
Practical impact of the Gibbs phenomenon:
- DAC output: ~9% physical overshoot when reconstructing square waves, potentially triggering downstream circuit thresholds
- FIR filters: Truncating the ideal impulse response → passband ripple; this is why window functions (Hann, Kaiser, etc.) are needed
- Solutions: Use Lanczos sigma factors, Fejer summation (Cesàro averaging), or directly apply window functions for smooth truncation
How to Use: Convergence Rate Guides Engineering Design
Smoother signal → faster coefficient decay → less bandwidth needed. Practical applications of this principle:
| Scenario | Application | Specific Parameters |
|---|---|---|
| Pulse shaping | Raised-cosine roll-off factor $\alpha$ controls smoothness | $\alpha=0.25$: bandwidth = $1.25/T_s$ |
| DAC reconstruction | Higher-order interpolation reduces aliasing energy | Linear interp.: $-12$ dB/oct; cubic: $-24$ dB/oct |
| FIR design | Window functions are equivalent to smooth truncation | Kaiser $\beta=6$: sidelobes $-46$ dB |
Interactive: Convergence Rate Comparison
Compare the Fourier series convergence speeds of square wave ($1/n$), triangle wave ($1/n^2$), and parabolic wave ($1/n^3$). Drag the slider to observe the Gibbs phenomenon.
Pitfalls & Common Misconceptions
- "Adding more terms will eliminate Gibbs ringing" — Wrong. The overshoot percentage is fixed at 8.95%, regardless of $N$.
- "$L^2$ convergence is sufficient" — Usually yes, but if you care about signal peak values (e.g., power amplifier headroom design), $L^2$ convergence does not guarantee peak control.
- "Carleson's theorem says almost-everywhere convergence" — But at discontinuities, the series converges to the average of the left and right limits $\frac{f(t^+)+f(t^-)}{2}$, not either side's value.
References: [1] Carleson, On convergence and growth of partial sums of Fourier series, Acta Math., 1966. [2] Korner, Fourier Analysis, Cambridge. [3] Gibbs, Fourier's Series, Nature, 1899.
✅ Quick Check
Q1: Why does the Fourier series of a triangle wave converge faster than that of a square wave?
Show answer
The triangle wave is continuous (C⁰), so coefficients decay as 1/n²; the square wave is discontinuous, with coefficients decaying only as 1/n. Smoother → faster convergence.
Q2: Does the Gibbs overshoot disappear as N→∞?
Show answer
The 'width' of the overshoot approaches zero, but the 'height percentage' is always ~9% and never disappears.
1.4 Uncertainty Principle
Complete proof and engineering applications of the Heisenberg-Gabor inequality
Why does this matter? Because the uncertainty principle determines the limits of everything you can achieve — how to choose STFT window length, why radar range and velocity resolution cannot both be optimal simultaneously, and optimization of communications pulse shaping. It is not a "theoretical limitation" but an engineering constraint you face in your daily work.
Previously... Section 1.3 showed that Fourier series indeed converge (in the L² sense), with convergence rate depending on signal smoothness. But even with convergence, the 'concentration' in time and frequency domains has an unbreakable lower bound —
One-line summary: You cannot simultaneously know precisely "when a signal occurs" and "what frequencies it contains" — this is a mathematical theorem, not an instrument limitation.
Learning Objectives
- Define the rigorous mathematical meaning of time-domain spread $\Delta t$ and frequency-domain spread $\Delta\omega$
- Derive $\Delta t\cdot\Delta\omega \geq \frac{1}{2}$ completely from the Cauchy-Schwarz inequality
- Prove that the Gaussian achieves equality (minimum uncertainty)
- Connect to STFT resolution limits and the radar ambiguity function
The Problem: Why Is STFT Window Length So Hard to Choose?
Nearly every engineer doing time-frequency analysis has encountered this dilemma:
- STFT window length: Too short → poor frequency resolution (can't separate two close frequencies); too long → poor time resolution (can't separate two close events). No choice seems right.
- Radar waveform design: Range resolution requires short pulses (large bandwidth), velocity resolution requires long pulses (narrow bandwidth). Why can't both be optimal?
- Communications systems: Narrower OFDM subcarrier spacing → higher spectral efficiency, but longer symbol duration → more sensitive to time-varying channels.
The answer is not that your design isn't good enough — mathematics itself forbids simultaneous precision. This is the uncertainty principle.
Historical context: Dennis Gabor (1900–1979, 1971 Nobel Prize in Physics laureate, inventor of holography) introduced Werner Heisenberg's quantum mechanical uncertainty principle into signal theory in his seminal 1946 paper Theory of Communication. Gabor pointed out: if signals are viewed as "information quanta (logons)" on the time-frequency plane, each quantum occupies at least $\Delta t \cdot \Delta f \geq \frac{1}{4\pi}$ of area. This result has exactly the same mathematical structure as quantum mechanics' $\Delta x \cdot \Delta p \geq \hbar/2$ — the difference lies only in the physical interpretation.
Principles: Rigorous Statement
Intuition first: Imagine a guitar string being plucked. If you play only an extremely short note (click), you know the precise time, but "what note it is" is vague. If you sustain a steady note (drone), the frequency is clear, but "when it started" is ambiguous. You cannot make both infinitely precise simultaneously.
Define the "spread" of the time and frequency domains as root-mean-square widths (second moments):
Heisenberg-Gabor Inequality
$$\boxed{\Delta t \cdot \Delta\omega \geq \frac{1}{2}}$$Equality holds if and only if $f(t) = Ce^{-\alpha t^2}$ (Gaussian)
Complete Proof
Show full derivation (using Cauchy-Schwarz)
Step 1: Without loss of generality, assume $\|f\| = 1$ (normalized), and the centroid of $f$ is at the origin (otherwise translate).
Step 2: Using the differentiation property $\mathcal{F}\{tf(t)\} = j\frac{d}{d\omega}F(\omega)$, and $\mathcal{F}\{f'(t)\} = j\omega F(\omega)$. By Parseval:
$$\Delta t^2 = \int t^2|f(t)|^2\,dt, \quad \Delta\omega^2 = \int \omega^2|F(\omega)|^2\,d\omega = \frac{1}{2\pi}\int|f'(t)|^2\,dt$$Step 3: Apply the Cauchy-Schwarz Inequality to $tf(t)$ and $f'(t)$:
$$\left|\int tf(t)\overline{f'(t)}\,dt\right|^2 \leq \int t^2|f|^2\,dt \cdot \int |f'|^2\,dt = \Delta t^2 \cdot 2\pi\Delta\omega^2$$Step 4: Compute the left side. Note that $f\overline{f'} = \frac{1}{2}\frac{d}{dt}|f|^2$ (when $f$ is real-valued or taking the real part), using integration by parts:
$$\text{Re}\int tf(t)\overline{f'(t)}\,dt = \int t \cdot \frac{1}{2}\frac{d}{dt}|f(t)|^2\,dt$$ $$= \left[\frac{t}{2}|f|^2\right]_{-\infty}^{\infty} - \frac{1}{2}\int|f(t)|^2\,dt = 0 - \frac{1}{2}\|f\|^2 = -\frac{1}{2}$$(The boundary terms vanish because $f \in L^2$ implies $t|f(t)|^2 \to 0$ as $|t|\to\infty$.)
Step 5: Therefore $\left|\int tf\overline{f'}\,dt\right| \geq \frac{1}{2}$. Substituting into Cauchy-Schwarz:
$$\frac{1}{4} \leq \Delta t^2 \cdot 2\pi\Delta\omega^2$$Using $\Delta\omega$ (angular frequency): $\Delta t \cdot \Delta\omega \geq \frac{1}{2}$.
Using $\Delta f$ (Hz): $\Delta t \cdot \Delta f \geq \frac{1}{4\pi}$. $\;\blacksquare$
Equality condition: Cauchy-Schwarz equality $\iff$ $f'(t) = -2\alpha t f(t)$ $\iff$ $f(t) = Ce^{-\alpha t^2}$. The Gaussian is the only waveform that achieves minimum uncertainty.
How to Use: Three Major Engineering Constraints
1. STFT Window Length Selection
Window length $T_w$ determines the time-frequency resolution:
| Analysis Goal | Window Length Recommendation | Typical Values |
|---|---|---|
| Speech formant tracking | Short window (time resolution priority) | $T_w = 20{-}30$ ms |
| Music pitch detection | Long window (frequency resolution priority) | $T_w = 50{-}100$ ms |
| Vibration monitoring | Adjust according to frequency range | $\Delta f < 1$ Hz → $T_w > 1$ s |
2. Radar Waveform Design
Range resolution $\Delta R = c/(2B)$ ($B$ = bandwidth), velocity resolution $\Delta v = \lambda/(2T)$ ($T$ = pulse duration). The uncertainty principle requires $BT \geq 1$, therefore:
Solution: Use chirp pulses, maintaining both large $B$ and large $T$ simultaneously, so that $BT \gg 1$.
3. Communications Pulse Shaping
OFDM symbol length $T_{sym}$ and subcarrier spacing $\Delta f_{sc}$ satisfy $T_{sym} \cdot \Delta f_{sc} = 1$ (exactly at equality). 5G NR numerology (15/30/60/120 kHz) switches between different time-frequency tradeoffs.
Applications
- 5G NR OFDM numerology: Subcarrier spacing 15 kHz → symbol length 66.7 us (low-speed mobility); 120 kHz → 8.33 us (mmWave high-speed scenarios). The tradeoff is determined by the uncertainty principle.
- X-band radar (FMCW): Bandwidth $B = 150$ MHz → $\Delta R = 1$ m. Pulse repetition interval $T_{PRI} = 100$ us → $\Delta v \approx 8$ m/s. $BT = 15000 \gg 1$ (chirp compression).
- Electroencephalography (EEG) time-frequency analysis: Analyzing alpha waves (8-13 Hz) requires $\Delta f < 5$ Hz → window length $> 200$ ms, limiting the ability to detect transient events.
Interactive: $\Delta t \cdot \Delta f$ Product of the Gaussian
Adjust the $\alpha$ parameter of the Gaussian $e^{-\alpha t^2}$. Observe: narrower in time → wider in frequency, but the product remains constant at $\frac{1}{4\pi}$ (equality).
Pitfalls & Limitations
- The uncertainty principle is a lower bound, not an equality: Only the Gaussian achieves equality. The $\Delta t \cdot \Delta f$ of a rectangular window is much larger than $1/(4\pi)$.
- Do not confuse with SNR: The uncertainty principle limits spread, not detection capability. At high SNR, you can still estimate frequency precisely (e.g., phase-locked loops).
- "Super-resolution" does not violate the uncertainty principle: Methods like MUSIC and ESPRIT exploit structural assumptions about the signal (e.g., sinusoidal models), bypassing the uncertainty principle's limitations. But the price is: if the assumption is wrong, the method collapses.
References: [1] Folland & Sitaram, The uncertainty principle: a mathematical survey, J. Fourier Anal. Appl., 1997. [2] Gabor, Theory of Communication, J. IEE, 1946. [3] Grochenig, Foundations of Time-Frequency Analysis, Birkhauser.
✅ Quick Check
Q1: If you double the STFT window length, how do the frequency and time resolutions change?
Show answer
Frequency resolution Δf is halved (improves), time resolution Δt is doubled (worsens). The product Δt·Δf remains unchanged.
Q2: What signal achieves the lower bound of the uncertainty principle?
Show answer
The Gaussian pulse e^{-αt²}; its FT is also a Gaussian. It is the only signal that achieves equality at Δt·Δf = 1/(4π).
2.1 Fourier Series
Frequency-domain representation of periodic signals — Complete theory
Why does this matter? Because the Fourier series is where everything begins. Power engineers analyzing THD, audio engineers analyzing timbre, communications engineers analyzing modulation — all start from "decomposing signals into harmonics." Without understanding Fourier series, every subsequent tool is a black box.
Previously... Part I established the mathematical foundation. Now we start building — beginning with the most fundamental Fourier series, analyzing what frequency components a periodic signal contains.
One-line summary: Any periodic signal can be decomposed into a superposition of sinusoids — these sinusoids have frequencies that are integer multiples of the fundamental frequency (harmonics).
Learning Objectives
- Derive coefficient formulas for the trigonometric and complex exponential forms
- Use symmetry (odd/even/half-wave) to simplify coefficient calculations
- Compute Fourier coefficients for square, triangle, and sawtooth waves
- Understand THD (Total Harmonic Distortion) analysis
The Problem: Harmonic Issues in Power Systems
The 60 Hz AC power from the utility grid is ideally a pure sinusoid. But in real systems:
- Nonlinear loads (rectifiers, variable-frequency drives, LED drivers) inject harmonic currents
- The 3rd harmonic (180 Hz) does not cancel in three-phase systems, accumulating in the neutral wire → neutral wire overheating
- IEEE 519-2022 standard specifies THD must not exceed 5% (voltage) or 8% (current)
To analyze these harmonics, you need the Fourier series. This is not abstract math — it's a practical problem power engineers face every day.
Historical context: Joseph Fourier (1768–1830) submitted a paper on heat conduction to the French Academy of Sciences in 1807, claiming that "any function can be expanded as a series of sinusoidal functions." Reviewer Lagrange strongly objected, believing discontinuous functions could not have such an expansion. The paper was rejected. But Fourier was right (at least in the $L^2$ sense). This controversy spawned the most profound advances in 19th-century analysis — from the Riemann integral to the Lebesgue integral, from pointwise convergence to $L^2$ convergence. Fourier eventually published Theorie analytique de la chaleur in 1822.
Principles: Two Forms
Intuition: Think of a periodic signal as a musical chord. A chord consists of a fundamental (fundamental frequency $f_0$) plus overtones ($2f_0, 3f_0, \ldots$). The Fourier series tells you how strong each overtone is and what its phase is.
Trigonometric Form:
$$f(t) = \frac{a_0}{2} + \sum_{n=1}^{\infty}\left[a_n\cos(n\omega_0 t) + b_n\sin(n\omega_0 t)\right]$$ $$a_n = \frac{2}{T}\int_0^T f(t)\cos(n\omega_0 t)\,dt, \quad b_n = \frac{2}{T}\int_0^T f(t)\sin(n\omega_0 t)\,dt$$Complex Exponential Form:
$$f(t) = \sum_{n=-\infty}^{\infty}c_n\,e^{jn\omega_0 t}, \quad c_n = \frac{1}{T}\int_0^T f(t)\,e^{-jn\omega_0 t}\,dt$$Relationships: $c_0 = a_0/2$, $c_n = (a_n - jb_n)/2$, $c_{-n} = \overline{c_n}$ (real signals)
Symmetry Simplification
Using signal symmetry can greatly simplify coefficient calculations — you can even know certain coefficients are zero without integrating:
| Signal Symmetry | Result | Series Contains Only | Examples |
|---|---|---|---|
| Even function $f(-t)=f(t)$ | $b_n=0$ | Cosine only | Triangle wave, rectified sine |
| Odd function $f(-t)=-f(t)$ | $a_n=0$ | Sine only | Square wave (odd-symmetric), sawtooth wave |
| Half-wave symmetry $f(t+T/2)=-f(t)$ | Even harmonics are zero | Odd harmonics only | Square wave ($1,3,5,\ldots$ harmonics) |
Engineering quick reference: A square wave is both an odd function + half-wave symmetric → contains only odd-order sine harmonics. In power systems, a full-wave rectified waveform is an even function + no half-wave symmetry → contains all even-order cosine harmonics.
Classic Waveform Coefficient Derivations
Square Wave Coefficient Derivation
Square wave with period $T$ and amplitude $\pm 1$: $f(t) = 1$ for $0 < t < T/2$, $f(t) = -1$ for $T/2 < t < T$.
$$c_n = \frac{1}{T}\left[\int_0^{T/2}e^{-jn\omega_0 t}\,dt - \int_{T/2}^{T}e^{-jn\omega_0 t}\,dt\right]$$ $$= \frac{1}{T}\cdot\frac{1}{-jn\omega_0}\left[(e^{-jn\pi}-1) - (e^{-jn2\pi}-e^{-jn\pi})\right]$$ $$= \frac{1}{-jn\omega_0 T}\left[2e^{-jn\pi}-1-1\right] = \frac{2((-1)^n-1)}{-jn\cdot 2\pi}$$$n$ even → $c_n = 0$. $n$ odd → $c_n = \frac{2}{jn\pi}$.
Converting back to sine form: $b_n = 4/(n\pi)$ (odd $n$), $a_n = 0$.
$$\boxed{f(t) = \frac{4}{\pi}\sum_{n=1,3,5,\ldots}\frac{1}{n}\sin(n\omega_0 t)} \quad\blacksquare$$Triangle Wave Coefficient Derivation
The triangle wave is the integral of the square wave. Using the differentiation property: if $f(t) = \text{square wave}$, $g(t) = \text{triangle wave} = \int f$, then the Fourier coefficients of $g$ are $d_n = c_n/(jn\omega_0)$.
$$d_n = \frac{4/(jn\pi)}{jn\omega_0} = \frac{-4}{n^2\pi\omega_0} \quad (\text{odd } n)$$Converting to standard form (triangle wave with amplitude 1):
$$\boxed{f(t) = \frac{8}{\pi^2}\sum_{n=1,3,5,\ldots}\frac{(-1)^{(n-1)/2}}{n^2}\sin(n\omega_0 t)} \quad\blacksquare$$Note that coefficients decay as $1/n^2$ (faster than the square wave's $1/n$), because the triangle wave is continuous.
How to Use: THD Analysis Steps
Total Harmonic Distortion (THD) is the core metric for measuring waveform distortion:
Practical steps:
- Use an ADC to sample one complete period (or average over multiple periods)
- Perform FFT to obtain the amplitude of each harmonic $|c_n|$
- Fundamental = $|c_1|$, harmonics = $|c_2|, |c_3|, \ldots$ (typically up to the 40th is sufficient)
- Substitute into the THD formula. IEEE 519 requires voltage THD $\leq 5\%$, individual harmonics $\leq 3\%$
Applications
- Power quality monitoring: In the 60 Hz system, a typical 6-pulse rectifier generates 5th, 7th, 11th, and 13th harmonics ($6k \pm 1$ pattern). The 3rd harmonic (180 Hz) accumulates in the three-phase neutral wire, potentially causing neutral current to reach $\sqrt{3}$ times the phase current.
- Timbre analysis: The spectrum of a piano A4 (440 Hz) contains a strong fundamental and gradually decaying harmonics; a violin at the same pitch has an entirely different harmonic structure — this is why the two instruments sound different.
- RF amplifier linearity: Nonlinearity in power amplifiers generates harmonics and intermodulation distortion (IMD). Second-order harmonics ($2f$) and third-order intermodulation ($2f_1 - f_2$) are the most critical design metrics.
Interactive: Harmonic Synthesis
Select a waveform and number of terms, and observe how the Fourier series progressively approximates the original waveform. Note the Gibbs ringing of the square wave.
Pitfalls & Common Misconceptions
- "Low THD means the waveform is clean" — Not necessarily. THD is an RMS metric and may mask a particularly strong high-order harmonic. You need to also examine individual harmonic amplitudes.
- "Fourier series only applies to periodic signals" — Strictly speaking, yes. But in practice, as long as your time window captures an integer number of periods, the DFT result is equivalent to the Fourier series. Non-integer periods → leakage problems.
- "The trigonometric form is more intuitive" — For beginners, yes, but the complex form is more convenient for derivations and computations. It's advisable to be familiar with converting between both forms.
References: [1] Oppenheim & Willsky, Signals and Systems, Ch.3. [2] Stein & Shakarchi, Fourier Analysis, Princeton. [3] IEEE 519-2022, Standard for Harmonic Control.
✅ Quick Check
Q1: Why does a square wave have only odd-order harmonics, and what symmetry is this related to?
Show answer
Half-wave symmetry f(t+T/2)=-f(t). Signals satisfying this symmetry contain only odd-order harmonics.
Q2: What is the frequency of the 3rd harmonic in a 60 Hz power system?
Show answer
60×3 = 180 Hz.
2.2 Continuous-Time Fourier Transform (CTFT)
From Fourier series to aperiodic signals — letting $T\to\infty$
Why does this matter? Because the CTFT extends Fourier analysis from periodic signals to all signals. "Filtering = frequency-domain multiplication" — this core theorem (convolution theorem) that makes the entire DSP field possible — is a property of the CTFT.
Previously... The Fourier series from 2.1 can only handle periodic signals. But in reality, most signals are not periodic (speech, transients, random signals). How do we generalize?
One-line summary: Generalizing the Fourier series to aperiodic signals — any signal with finite energy can be decomposed into continuous frequency components.
Learning Objectives
- Derive the CTFT from FS ($T\to\infty$ limiting process)
- Prove core properties: time shift, convolution, Parseval, etc.
- Derive rect $\leftrightarrow$ sinc and Gaussian $\leftrightarrow$ Gaussian transform pairs
- Understand the engineering significance of each property
The Problem: Real Signals Are Not Periodic
The Fourier series can only handle periodic signals, but in reality almost no signals are strictly periodic:
- Speech, music: transient signals with a beginning and an end
- Radar echoes: a pulse that comes and goes
- Seismic waves, brain waves: aperiodic, non-stationary
We need a more general tool that can handle any finite-energy signal. The CTFT is that tool.
Historical context: The concept of CTFT originates from an elegant limiting process: letting the period of the Fourier series $T \to \infty$. As $T$ increases, the spacing between discrete harmonic frequencies $n\omega_0 = 2\pi n/T$ shrinks ($\omega_0 \to 0$), and the discrete summation becomes a continuous integral. Fourier himself proposed this idea in 1822. Rigorous mathematical foundations were established by Plancherel (1910) and Wiener (1933).
Principles: From FS to CTFT
Intuition first: The Fourier series tells you "how strong the $n$-th harmonic is" (discrete frequencies). As the period → infinity, the spacing between harmonics → zero, and you no longer have discrete "frequency indices" but rather a continuous spectral density function $F(\omega)$.
Show full derivation: The $T \to \infty$ Limit
Periodic signal $f_T(t) = \sum_n c_n e^{jn\omega_0 t}$, $c_n = \frac{1}{T}\int_{-T/2}^{T/2}f(t)e^{-jn\omega_0 t}dt$.
Define $F_T(\omega) = Tc_n\big|_{\omega=n\omega_0} = \int_{-T/2}^{T/2}f(t)e^{-j\omega t}dt$ (moving the $1/T$ factor inside $F_T$).
Then the original expansion becomes:
$$f_T(t) = \frac{1}{T}\sum_n F_T(n\omega_0)e^{jn\omega_0 t} = \frac{1}{2\pi}\sum_n F_T(n\omega_0)\,\underbrace{\omega_0}_{\Delta\omega}\, e^{jn\omega_0 t}$$Let $T\to\infty$: $\omega_0 = 2\pi/T \to d\omega$, $n\omega_0 \to \omega$ (continuous), discrete sum → Riemann integral:
$$\boxed{f(t) = \frac{1}{2\pi}\int_{-\infty}^{\infty}F(\omega)\,e^{j\omega t}\,d\omega, \quad F(\omega) = \int_{-\infty}^{\infty}f(t)\,e^{-j\omega t}\,dt} \quad\blacksquare$$CTFT Pair (Fourier Transform Pair)
$$F(\omega) = \int_{-\infty}^{\infty}f(t)\,e^{-j\omega t}\,dt \quad \text{(Analysis / Forward Transform)}$$ $$f(t) = \frac{1}{2\pi}\int_{-\infty}^{\infty}F(\omega)\,e^{j\omega t}\,d\omega \quad \text{(Synthesis / Inverse Transform)}$$Core Properties and Engineering Significance
| Property | Time Domain | Frequency Domain | Engineering Significance |
|---|---|---|---|
| Time Shift | $f(t-t_0)$ | $e^{-j\omega t_0}F(\omega)$ | Delay = phase rotation, amplitude unchanged |
| Frequency Shift | $e^{j\omega_0 t}f(t)$ | $F(\omega-\omega_0)$ | Modulation = shifting spectrum to carrier frequency |
| Scaling | $f(at)$ | $\frac{1}{|a|}F(\omega/a)$ | Time compression ↔ frequency expansion (uncertainty principle) |
| Convolution | $f*g$ | $F\cdot G$ | LTI filtering = frequency-domain multiplication (core of filter design) |
| Multiplication | $f\cdot g$ | $\frac{1}{2\pi}F*G$ | Window truncation = frequency-domain convolution (source of leakage) |
| Differentiation | $f'(t)$ | $j\omega F(\omega)$ | High-pass effect: high frequencies are amplified |
| Parseval | $\int|f|^2 dt = \frac{1}{2\pi}\int|F|^2 d\omega$ | Time-domain energy = frequency-domain energy (energy conservation) | |
Convolution Theorem Proof
Swapping the order of integration, let $u = t - \tau$:
$$= \int f(\tau)e^{-j\omega\tau}\left[\int g(u)e^{-j\omega u}du\right]d\tau = F(\omega)\cdot G(\omega) \quad\blacksquare$$Engineering significance: The output of an LTI system $y = x * h$ becomes $Y = X \cdot H$ in the frequency domain — this is why we can describe filters using the frequency response $H(\omega)$. Multiplication is much simpler than convolution.
How to Use: Common Transform Pairs Quick Reference
| $f(t)$ | $F(\omega)$ | Memory Aid |
|---|---|---|
| $\text{rect}(t/\tau)$ | $\tau\,\text{sinc}(\omega\tau/2\pi)$ | Rectangular window ↔ sinc leakage |
| $e^{-\alpha|t|}$ | $\frac{2\alpha}{\alpha^2+\omega^2}$ | Exponential decay ↔ Lorentzian |
| $e^{-\alpha t^2}$ | $\sqrt{\pi/\alpha}\,e^{-\omega^2/(4\alpha)}$ | Gaussian ↔ Gaussian |
| $\delta(t)$ | $1$ | Impulse contains all frequencies |
| $1$ | $2\pi\delta(\omega)$ | DC ↔ zero-frequency delta |
| $e^{j\omega_0 t}$ | $2\pi\delta(\omega-\omega_0)$ | Pure tone ↔ spectral line |
Applications
- Filter design (convolution theorem): Want to design a 1 kHz lowpass filter? Draw an ideal rectangular passband $H(\omega)$ in the frequency domain, inverse FT to get impulse response $h(t) = \text{sinc}$ — then truncate with a window. FM radio IF filters (200 kHz bandwidth) are designed this way.
- AM modulation (frequency shift property): $x(t)\cos(\omega_c t) = \frac{1}{2}x(t)e^{j\omega_c t} + \frac{1}{2}x(t)e^{-j\omega_c t}$. Frequency domain: baseband spectrum is shifted to $\pm\omega_c$. AM broadcast uses carrier frequencies of 540-1600 kHz.
- Energy calculation (Parseval's theorem): Compute the energy of a UWB (ultra-wideband) pulse within the 3.1-10.6 GHz band: $E = \frac{1}{2\pi}\int_{2\pi\cdot3.1G}^{2\pi\cdot10.6G}|F(\omega)|^2\,d\omega$.
Interactive: rect $\leftrightarrow$ sinc
Adjust the width $\tau$ of the rectangular pulse. Observe: narrower pulse → wider sinc main lobe (a direct manifestation of the uncertainty principle).
Pitfalls & Limitations
- Where does $1/(2\pi)$ go? Different textbooks use different conventions. This platform uses the $\omega$-convention: forward transform has no $1/(2\pi)$, inverse transform has it. The $f$-convention has neither, but the exponent is $e^{-j2\pi ft}$. Don't mix conventions!
- CTFT only applies to finite-energy ($L^1$ or $L^2$) signals: The CTFT of $\cos(\omega_0 t)$ requires distribution theory (see Section 1.2).
- The convolution theorem requires both functions to be integrable: For distributions or periodic signals, convolution must be defined more carefully.
References: [1] Oppenheim & Willsky, Signals and Systems, Ch.4. [2] Bracewell, The Fourier Transform and Its Applications, McGraw-Hill. [3] Papoulis, The Fourier Integral and Its Applications.
✅ Quick Check
Q1: What is the greatest engineering significance of the convolution theorem? State it in one sentence.
Show answer
Filtering = frequency-domain multiplication. By designing the shape of H(ω), you can selectively preserve or remove any frequency component.
Q2: What does a time-domain delay t₀ correspond to in the frequency domain?
Show answer
Multiplication by e^{-jωt₀} — magnitude unchanged, only phase rotation. This is why a linear-phase filter is equivalent to pure delay.
Interactive: Common CTFT Pairs
Choose a signal and observe its time- and frequency-domain representations. Try different widths to understand the time–frequency reciprocity.
2.3 Discrete-Time Fourier Transform (DTFT)
The bridge connecting the continuous and discrete worlds
Why does this matter? Because the DTFT is the bridge connecting the analog world (CTFT) and the digital world (DFT). Without understanding that "sampling causes spectral periodization," you cannot truly understand why aliasing occurs or what DFT results represent.
Previously... The CTFT from 2.2 is a tool for the continuous world. But computers can only process discrete number sequences. After sampling an analog signal into a digital signal, how does the spectrum change?
One-line summary: After sampling an analog signal into a digital signal, its spectrum becomes periodically repeated — the DTFT is the tool that describes this post-sampling world.
Learning Objectives
- Define the DTFT and understand the $2\pi$ periodicity of its frequency domain
- Derive the DTFT from CTFT + sampling (spectral periodization)
- Distinguish between DTFT (continuous frequency) and DFT (discrete frequency sampling)
- Understand the true role of zero-padding
The Problem: How Does the Spectrum Change After Sampling?
You used an ADC to digitize an analog signal at sampling rate $f_s = 48$ kHz. Now you have a sequence of numbers $x[0], x[1], x[2], \ldots$.
- What is the "spectrum" of this sequence? How does it relate to the original analog signal's spectrum?
- Why is the FFT output periodic with period $f_s$?
- Are the DTFT and DFT the same thing? If not, what's the difference?
The DTFT is the theoretical foundation for understanding digital signal processing. DFT/FFT is its practical computational version.
Historical context: The concept of DTFT was formalized alongside the rise of digital computation. In the 1960s, with the proliferation of A/D converters and digital computers, engineers needed a complete "discrete world" Fourier theory. The DTFT filled the theoretical gap between CTFT (purely continuous) and DFT (purely discrete, finite-length), and was systematically organized by Oppenheim, Schafer, and others in 1970s textbooks.
Principles: Definition and Periodicity
Intuition first: Sampling is like viewing a spinning wheel with a strobe light. If the strobe frequency is not high enough, the wheel appears to "spin backward" — this is because the spectrum undergoes periodic repetition (aliasing). The DTFT precisely describes this phenomenon.
DTFT Definition
$$X(e^{j\omega}) = \sum_{n=-\infty}^{\infty}x[n]\,e^{-j\omega n}$$$X(e^{j\omega})$ is a continuous function of $\omega$, periodic with period $2\pi$.
Inverse transform:
Why Is It $2\pi$-Periodic?
Because $e^{-j(\omega+2\pi)n} = e^{-j\omega n}\cdot e^{-j2\pi n} = e^{-j\omega n}\cdot 1 = e^{-j\omega n}$. Discrete sampling inherently cannot distinguish frequency $\omega$ from $\omega + 2\pi$ — this is the mathematical root of aliasing.
💡 Intuition: Why must discrete-time frequency be periodic?
Consider two discrete complex exponentials $e^{j\omega n}$ and $e^{j(\omega+2\pi)n}$:
$$e^{j(\omega+2\pi)n} = e^{j\omega n}\cdot e^{j2\pi n} = e^{j\omega n}\cdot 1 = e^{j\omega n}$$Because $e^{j2\pi n} = 1$ for every integer $n$.
Conclusion: Frequencies $\omega$ and $\omega+2\pi$ are indistinguishable in discrete time — they produce exactly the same sample values. So the DTFT's frequency axis is naturally $2\pi$-periodic.
This also explains why the Nyquist limit exists: when an analog signal's frequency exceeds $f_s/2$, this $2\pi$ wrap-around causes it to be "aliased" back into the low-frequency region.
Relationship with CTFT: Sampling → Periodization
If $x[n] = x_c(nT_s)$ (sampling), then:
Sampling causes spectral periodization. If the bandwidth of $X_c$ exceeds $\pi/T_s$ (the Nyquist frequency), adjacent copies overlap → aliasing, which is irreversible.
DTFT vs DFT: Key Differences
The DFT is a uniform sampling of the DTFT on the frequency axis:
| Property | DTFT | DFT |
|---|---|---|
| Input | Infinite-length sequence $x[n]$ | Finite-length $N$-point sequence |
| Output | Continuous function $X(e^{j\omega})$ | $N$ discrete values $X[k]$ |
| Frequency resolution | Continuous (infinite resolution) | $\Delta f = f_s/N$ |
| Computability | Theoretical tool | FFT enables fast computation |
Key insight: The DTFT gives the complete continuous spectrum; the DFT merely takes $N$ equally spaced samples from it. Zero-padding increases the DFT's sampling density (revealing more detail of the DTFT), but does not change the DTFT itself — zero-padding does not improve frequency resolution, it only improves the "display resolution" of the spectrum.
How to Use: Understanding DFT Results Through the DTFT
- Understand the meaning of DFT bins: $X[k]$ is a sample of the DTFT $X(e^{j\omega})$ at $\omega = 2\pi k/N$. The corresponding physical frequency is $f_k = k \cdot f_s / N$.
- Decide whether zero-padding is needed: If fine details of the DTFT between two peaks are missed by the DFT sampling, increase $N$ (via zero-padding or by collecting more data).
- Distinguish "true resolution" from "display resolution": True resolution is determined by the data length ($\Delta f = f_s / N_{data}$). Zero-padding only improves interpolation, not resolution.
Applications
- FIR filter frequency response: The frequency response of an FIR filter $h[n]$ (finite length $M$) is simply its DTFT: $H(e^{j\omega}) = \sum_{n=0}^{M-1}h[n]e^{-j\omega n}$. Using the DFT ($N \gg M$, with zero-padding) lets you plot the frequency response curve at high density.
- Frequency resolution in spectrum analysis: Analyzing audio sampled at $f_s = 48$ kHz and wanting $\Delta f = 1$ Hz resolution requires $N = f_s/\Delta f = 48000$ points, i.e., at least 1 second of data.
- CIC filter droop analysis: The DTFT of a CIC (Cascaded Integrator-Comb) filter is $|H(e^{j\omega})| = |\frac{\sin(M\omega/2)}{M\sin(\omega/2)}|^K$. The DTFT enables precise analysis of passband droop.
Interactive: DTFT vs DFT
Blue solid line = DTFT (approximated by a high-density DFT), red dots = $N$-point DFT samples. Increasing $N$ reveals more detail of the DTFT — but the DTFT itself does not change.
Pitfalls & Common Misconceptions
- "Zero-padding improves resolution" — This is the most common misconception. Zero-padding lets the DFT take more samples of the DTFT (a smoother spectral curve), but the DTFT itself is entirely determined by the original data. You cannot create new information from zeros.
- Units of $\omega$: In the DTFT, $\omega$ is normalized angular frequency (radians/sample), ranging over $[-\pi, \pi]$. The corresponding physical frequency is $f = \omega f_s/(2\pi)$. $\omega = \pi$ corresponds to the Nyquist frequency $f_s/2$.
- DTFT existence: $x[n]$ must be absolutely summable ($\sum|x[n]| < \infty$) or at least square-summable. The DTFT of an infinite-length periodic sequence (e.g., $\cos(\omega_0 n)$) requires distribution theory ($\delta$ functions appear in the frequency domain).
References: [1] Oppenheim & Schafer, Discrete-Time Signal Processing, Ch.2-5. [2] Proakis & Manolakis, Digital Signal Processing, Ch.4.
✅ Quick Check
Q1: What is the key difference between the DTFT and the DFT?
Show answer
The DTFT has a continuous frequency axis (ω takes all values), while the DFT takes N equally spaced samples on the frequency axis. DFT = sampled DTFT.
Q2: Why does sampling cause spectral periodization?
Show answer
Sampling = multiplication by an impulse train; in the frequency domain this becomes convolution with an impulse train = periodic repetition of the spectrum.
2.4 DFT & FFT
Discrete Fourier Transform & Fast Algorithm
Why does this matter? Because DFT/FFT is the computation you actually run on a computer — all the preceding theory ultimately lands through the FFT. Understanding circular vs. linear convolution and the true role of zero-padding is key to avoiding FFT misuse.
Previously... The DTFT from 2.3 gives a continuous spectrum — but computers cannot store continuous functions. We need to discretize the frequency axis as well, and that is the DFT. The FFT then makes the DFT fast enough for real-time computation.
One-line summary: The DFT is the version a computer can actually compute; the FFT is the algorithm that makes it fast — reducing complexity from $O(N^2)$ to $O(N\log N)$.
Learning Objectives
- DFT matrix perspective: $\mathbf{X} = \mathbf{W}_N\mathbf{x}$, unitary property
- Understand the difference between circular convolution and linear convolution
- Cooley-Tukey radix-2 divide-and-conquer derivation
- Choose $N$ correctly; understand the role and misconceptions of zero-padding
The Problem: The World Before 1965
Imagine the era before the FFT:
| $N$ | Direct DFT (multiplications) | FFT (multiplications) | Speedup |
|---|---|---|---|
| 1,024 | 1,048,576 | 5,120 | 205x |
| 4,096 | 16,777,216 | 24,576 | 683x |
| 1,048,576 | $1.1 \times 10^{12}$ | $10,485,760$ | 104,858x |
Before 1965, a single 1024-point spectrum analysis took several minutes on the computers of the day. The invention of the FFT reduced that same computation to milliseconds, directly enabling all modern digital signal processing applications.
Historical context: James Cooley and John Tukey published their landmark paper An algorithm for the machine calculation of complex Fourier series in 1965. The backdrop was Cold War nuclear test monitoring: the US needed to analyze seismic station data to detect Soviet underground nuclear tests. The massive demand for Fourier analysis gave birth to the FFT. Interestingly, Carl Friedrich Gauss had invented a similar algorithm as early as 1805 (for computing asteroid orbits), but his manuscript was not discovered until 1866 and, being written in Latin, was long overlooked.
Principles: DFT Matrix Perspective
Intuition first: The DFT multiplies an $N$-dimensional vector (time-domain signal) by a special $N \times N$ matrix (the twiddle-factor matrix) to produce another $N$-dimensional vector (frequency domain).
$\frac{1}{\sqrt{N}}\mathbf{W}_N$ is a unitary matrix: $\mathbf{W}_N^H\mathbf{W}_N = N\mathbf{I}$
Inverse DFT: $x[n] = \frac{1}{N}\sum_{k=0}^{N-1}X[k]\,W_N^{-kn}$, i.e., $\mathbf{x} = \frac{1}{N}\mathbf{W}_N^H\mathbf{X}$.
Expand derivation: Conjugate symmetry of the DFT for real signals
Theorem: If $x[n]$ is a real-valued sequence, its DFT satisfies $X[N-k] = X^*[k]$ (conjugate symmetry).
Derivation:
$$X[N-k] = \sum_{n=0}^{N-1} x[n]\, e^{-j2\pi(N-k)n/N}$$ $$= \sum_{n=0}^{N-1} x[n]\, e^{-j2\pi n}\, e^{j2\pi kn/N}$$Since $e^{-j2\pi n} = 1$ (for any integer $n$):
$$= \sum_{n=0}^{N-1} x[n]\, e^{j2\pi kn/N}$$Because $x[n]$ is real, $x[n] = x^*[n]$:
$$= \sum_{n=0}^{N-1} x^*[n]\, e^{j2\pi kn/N} = \left(\sum_{n=0}^{N-1} x[n]\, e^{-j2\pi kn/N}\right)^* = X^*[k] \quad\blacksquare$$Practical implications:
- The DFT of a real signal is fully determined by the first half $X[0], X[1], \ldots, X[N/2]$
- Computation and storage can be cut in half (real-FFT algorithms)
- $X[0]$ is real (DC), and $X[N/2]$ is also real (Nyquist bin)
- $|X[k]| = |X[N-k]|$ (magnitude spectrum is symmetric); $\angle X[k] = -\angle X[N-k]$ (phase spectrum is antisymmetric)
Circular Convolution vs Linear Convolution
The DFT corresponds to circular convolution (the tail wraps around to the head), not linear convolution.
| Type | $x$ length $M$, $y$ length $L$ | Result length | Required DFT size $N$ |
|---|---|---|---|
| Linear convolution | $M + L - 1$ | $M + L - 1$ | $N \geq M + L - 1$ |
| Circular convolution | $\max(M, L)$ | $N$ | $N$ (but may alias) |
Key point: To compute linear convolution via the DFT, you must zero-pad to $N \geq M + L - 1$. Otherwise the circular convolution's "wrap-around" will corrupt the result. This is the theoretical foundation of the overlap-add (OLA) and overlap-save (OLS) methods.
Radix-2 FFT: Divide and Conquer
Cooley-Tukey Radix-2 Divide-and-Conquer Derivation
Assume $N = 2^m$. Split the DFT into even-indexed and odd-indexed terms:
$$X[k] = \sum_{n=0}^{N-1}x[n]W_N^{kn} = \underbrace{\sum_{r=0}^{N/2-1}x[2r]W_N^{2rk}}_{A[k]} + W_N^k\underbrace{\sum_{r=0}^{N/2-1}x[2r+1]W_N^{2rk}}_{B[k]}$$Note that $W_N^{2rk} = W_{N/2}^{rk}$ (since $e^{-j2\pi\cdot 2r k/N} = e^{-j2\pi rk/(N/2)}$).
So $A[k]$ and $B[k]$ are each $N/2$-point DFTs!
$$X[k] = A[k] + W_N^k\,B[k], \quad k = 0, 1, \ldots, N/2-1$$ $$X[k+N/2] = A[k] - W_N^k\,B[k] \quad (\text{since } W_N^{k+N/2} = -W_N^k)$$This is the butterfly operation.
Complexity: $N/2$ butterflies per stage, $\log_2 N$ stages total $\to$ $O(N\log N)$. $\;\blacksquare$
How to Use: Choosing $N$ and Frequency Mapping
Step 1: Choose $N$
Need $\Delta f = 1$ Hz with $f_s = 48000$ Hz $\to$ $N = 48000$. FFT is most efficient when $N = 2^m$, so choose $N = 2^{16} = 65536$ ($\Delta f \approx 0.73$ Hz).
Step 2: Map FFT bins to physical frequencies
$k = 0$: DC, $k = N/2$: Nyquist, $k > N/2$: negative frequencies ($f_k - f_s$)
Step 3: Correctly understanding zero-padding
| What zero-padding does | What it does NOT do |
|---|---|
| Increases DFT sampling density (smoother spectral curve) | Improve true frequency resolution |
| Makes $N$ a power of 2 (most efficient FFT) | Increase the information content of the signal |
| Prevents circular convolution aliasing ($N \geq M+L-1$) | Reduce noise |
Python Example: Compute FFT and Plot Spectrum
Applications
- Audio spectrum analyzer: $f_s = 44100$ Hz, $N = 4096$ $\to$ $\Delta f = 10.8$ Hz. Sufficient to resolve adjacent piano keys (e.g., A4 = 440 Hz vs. A#4 = 466 Hz, a difference of 26 Hz).
- OFDM communications: 802.11ax (Wi-Fi 6) uses a 1024-point FFT with subcarrier spacing of 78.125 kHz. Each FFT bin corresponds to one subcarrier.
- Real-time spectrum analyzer: The Keysight RSA uses an $N = 2^{22}$ FFT, achieving $\Delta f \approx 26$ Hz resolution over a 110 MHz bandwidth, computed hundreds of times per second.
Pitfalls & Common Errors
- Getting FFT bin frequencies wrong: The most common mistake is treating $k = N-1$ as the highest frequency. In fact, $k > N/2$ corresponds to negative frequencies. For real-valued input, you only need $k = 0, \ldots, N/2$.
- Zero-padding $\neq$ improving resolution: Zero-padding 100 data points to 1024 points still gives a resolution of $f_s/100$, not $f_s/1024$.
- Circular convolution $\neq$ linear convolution: Forgetting to zero-pad and directly using FFT for convolution produces wrap-around artifacts.
- Endianness and normalization: Different FFT libraries (FFTW, NumPy, MATLAB) use different normalization conventions. Some divide by $N$ in the forward transform, others in the inverse. Verify by checking whether Parseval's identity holds.
References: [1] Cooley & Tukey, An algorithm for the machine calculation of complex Fourier series, Math. Comp., 1965. [2] Oppenheim & Schafer, Discrete-Time Signal Processing, Ch.8-9. [3] Van Loan, Computational Frameworks for the FFT, SIAM.
📝 Worked Example
You have a 0.5-second audio clip sampled at 8000 Hz. (a) How many sample points total? (b) What is the FFT frequency resolution Δf? (c) If you need to see frequency details down to 0.5 Hz, how long must the observation time be?
Show solution
(a) N = 0.5 × 8000 = 4000 points
(b) Δf = fs/N = 8000/4000 = 2 Hz
(c) Δf = 1/T → T = 1/0.5 = 2 seconds (zero-padding cannot substitute for this)
✅ Quick Check
Q1: How much faster is a 1024-point FFT compared to a direct DFT?
Show answer
DFT: N² = 1,048,576 multiplications. FFT: (N/2)log₂N = 5,120. Speedup ≈ 205x.
Q2: Does zero-padding to 4096 points improve the true frequency resolution?
Show answer
No. True resolution depends only on the observation time: Δf = 1/T. Zero-padding only makes the frequency axis denser (interpolation) and does not add new information.
Interactive: The True Effect of Zero-Padding
Signal: 50 Hz + 55 Hz, sampling rate 500 Hz, observation time 0.1 s (50 points). Zero-padding will not separate the two peaks — it only makes the spectrum smoother.
Decimation-in-Time vs Decimation-in-Frequency
The Cooley-Tukey FFT has two equivalent but distinct decompositions. This platform demonstrates DIT (Decimation-in-Time).
DIT — Decimation in Time
Split the input sequence into even and odd halves:
$$X[k] = \sum_{r=0}^{N/2-1}x[2r]W_{N/2}^{rk} + W_N^k\sum_{r=0}^{N/2-1}x[2r+1]W_{N/2}^{rk}$$- Input requires bit-reversal ordering
- Output is in natural order
- Butterfly structure: multiply then add
DIF — Decimation in Frequency
Split the output into even and odd groups:
$$X[2k] = \sum_{n=0}^{N/2-1}\left[x[n]+x[n+N/2]\right]W_{N/2}^{nk}$$ $$X[2k+1] = \sum_{n=0}^{N/2-1}\left[x[n]-x[n+N/2]\right]W_N^n W_{N/2}^{nk}$$- Input is in natural order
- Output requires bit-reversal ordering
- Butterfly structure: add then multiply
Practical choice:
- Both have exactly the same operation count: $\frac{N}{2}\log_2 N$ complex multiplications
- Most modern FFT libraries (FFTW, MKL) support both
- The choice is usually dictated by "how upstream/downstream stages order their data," to avoid extra bit-reversal
- DIT is more intuitive (from the butterfly-diagram perspective); DIF is its transpose
⚠ Common FFT Implementation Pitfalls
These are the traps engineers most frequently fall into when actually using numpy.fft.fft or scipy.fft.
Pitfall 1: Forgetting Normalization
numpy's fft performs no normalization by default. For an N-point amplitude spectrum, multiply by 2/N (single-sided):
Pitfall 2: Misusing fftfreq
The physical frequency corresponding to index k is $f_k = k \cdot f_s / N$, not $k$ itself:
Pitfall 3: Spectral Leakage
If the signal frequency is not an integer multiple of $f_s/N$ (i.e., not centered in a bin), energy leaks into neighboring bins. Always apply a window:
Pitfall 4: Zero-Padding ≠ Higher Resolution
Zero-padding only makes DFT bins denser (an interpolation effect) — it does not increase the true frequency resolution. True resolution is determined by the observation time $T$: $\Delta f = 1/T$.
Pitfall 5: DC Drift
If the signal has a DC offset, the FFT will show a huge peak at k=0 that may mask low-frequency components you care about. Remove the mean first:
Pitfall 6: Complex vs Real FFT
For real-valued signals, use np.fft.rfft() — it's roughly 2x faster than fft() (only computes half) thanks to conjugate symmetry.
2.5 Z-Transform (Z-Transform)
A unified analysis framework for discrete-time systems
Why does this matter? Because the Z-transform is the universal tool for discrete-system analysis. Determining whether a filter is stable, reading the frequency response from a pole-zero plot, designing IIR filters — all rely on the Z-transform. Its role in discrete systems is equivalent to that of the Laplace transform in continuous systems.
Previously... The DFT/FFT from 2.4 is a computational tool. But to analyze discrete-system stability and design digital filters, we need a more powerful framework — the Z-transform.
One-line summary: The Z-transform is the "universal tool" for digital systems — stability, frequency response, and filter design all depend on it.
Learning Objectives
- Define the Z-transform and its ROC (Region of Convergence)
- Understand that DFT = Z-transform sampled on the unit circle
- Use pole-zero analysis to determine stability and frequency response
- Design simple IIR filters
The Problem: Core Questions About Digital Filters
You have designed a digital filter whose difference equation is:
- Is it stable? (Will the output blow up?)
- What does its frequency response look like? (Which frequencies are amplified, which are attenuated?)
- If I want to completely eliminate 1 kHz (notch), how should I modify it?
These questions are hard to answer using the time-domain difference equation. The Z-transform turns the difference equation into an algebraic equation, making everything clear.
Historical context: The Z-transform is the discrete counterpart of the Laplace transform. In continuous systems, the Laplace transform converts differential equations into algebraic equations, with the variable $s$ living in the complex plane. The Z-transform does the same thing for discrete systems: the variable $z$ also lives in the complex plane, and $z = e^{sT_s}$ maps the $s$-plane to the $z$-plane. Lotfi Zadeh (later famous for fuzzy logic) and John Ragazzini introduced the modern form of the Z-transform in 1952.
Principles: Definition and Region of Convergence
Intuition first: The DTFT evaluates a sequence on the frequency axis ($e^{j\omega}$, the unit circle). The Z-transform extends this evaluation from the unit circle to the entire complex plane. This extra degree of freedom lets us analyze stability, causality, and other properties that the DTFT alone cannot directly reveal.
Z-transform Definition
$$X(z) = \sum_{n=-\infty}^{\infty}x[n]\,z^{-n}, \quad z \in \mathbb{C}$$When $z = e^{j\omega}$ (unit circle): $X(e^{j\omega}) = \text{DTFT}$
When $z = e^{j2\pi k/N}$ ($N$ equally spaced points on the unit circle): $X[k] = \text{DFT}$
Region of Convergence (ROC)
The set of $z$ values for which $\sum|x[n]||z|^{-n}$ converges. The shape of the ROC determines the nature of the sequence:
| Sequence Type | ROC Shape | Example |
|---|---|---|
| Finite-length (FIR) | Entire $z$-plane (possibly excluding $z=0$ or $z=\infty$) | $x[n] = \delta[n] - 0.5\delta[n-1]$ |
| Causal right-sided sequence | Exterior of a circle: $|z| > r_{\max}$ | $x[n] = a^n u[n]$, ROC: $|z|>|a|$ |
| Anti-causal left-sided sequence | Interior of a circle: $|z| < r_{\min}$ | $x[n] = -a^n u[-n-1]$, ROC: $|z|<|a|$ |
Stability criterion: A causal LTI system is stable $\iff$ its ROC contains the unit circle $\iff$ all poles lie inside the unit circle ($|p_i| < 1$). This is the sole criterion for digital filter stability.
Pole-Zero Analysis
For a rational transfer function (IIR filter):
$q_k$: zeros, $p_k$: poles
Intuition for the frequency response: $H(e^{j\omega})$ is obtained by walking once around the unit circle and looking at how far each position is from the poles and zeros.
- Pole close to the unit circle → that frequency is amplified (resonance). The closer the pole, the sharper the peak.
- Zero close to the unit circle → that frequency is attenuated (notch). A zero on the unit circle = complete elimination.
- Pole outside the unit circle → the system is unstable.
Why do poles cause resonance?
Near $\omega = \theta_p$ (the pole angle):
$$|H(e^{j\omega})| \approx \frac{|b_0|\prod|e^{j\omega}-q_k|}{\prod_{k\neq i}|e^{j\omega}-p_k| \cdot |e^{j\omega}-p_i|}$$When $\omega \approx \theta_p$, $|e^{j\omega} - p_i| = |e^{j\omega} - |p_i|e^{j\theta_p}| \approx 1 - |p_i|$ (small).
Therefore $|H| \approx \frac{C}{1-|p_i|}$. As $|p_i| \to 1$, the gain tends to infinity.
The 3-dB bandwidth of the peak is $\approx 2(1-|p_i|)$ radians. $\;\blacksquare$
How to Use: Designing Filters from Pole-Zero Plots
- Notch filter (eliminating a specific frequency): Place zeros at $z = e^{j\omega_0}$ and $z = e^{-j\omega_0}$, paired with nearby poles at $z = r\,e^{\pm j\omega_0}$ ($r < 1$, e.g., $r = 0.95$) to control the notch width.
- Resonator (enhancing a specific frequency): Place poles at $z = r\,e^{\pm j\omega_0}$; the closer $r$ is to 1, the sharper the resonance. $Q$ factor $\approx \frac{\omega_0}{2(1-r)}$.
- Stability check: Are all poles $|p_k| < 1$? If any pole lies on or outside the unit circle, the system is unstable. Use MATLAB's
zplane(b,a)or Python'sscipy.signal.tf2zpk.
Applications
- IIR filter design: Butterworth, Chebyshev, and elliptic filters are designed with poles and zeros placed in the $s$-plane, then mapped to the $z$-plane via the bilinear transform $s = \frac{2}{T_s}\frac{1-z^{-1}}{1+z^{-1}}$. For example, a 5th-order Butterworth lowpass has 5 poles uniformly distributed on the left half of the $s$-plane circle.
- Control-system stability: The closed-loop transfer function $H_{cl}(z)$ of a digital PID controller must have all poles inside the unit circle. If some pole has $|p| = 0.98$ (very close to but still inside), the system is stable but will exhibit slowly decaying oscillations.
- Audio equalizer (EQ): Each band of a parametric EQ is a pair of conjugate poles plus a pair of conjugate zeros. The center frequency is set by the pole-zero angle, the $Q$ factor by the radius, and the gain by the pole-to-zero distance ratio.
Interactive: Pole-Zero Plot & Frequency Response
Adjust the pole position (conjugate pair $p = re^{\pm j\theta}$) and observe how the frequency response (= Z-transform evaluated on the unit circle) changes. The closer the radius is to 1, the sharper the resonance peak.
Pitfalls & Limitations
- The ROC must be specified: The same algebraic expression $X(z)$ can correspond to different time-domain sequences depending on the ROC. For example, $X(z) = 1/(1-az^{-1})$: ROC $|z|>|a|$ $\to$ causal exponential $a^n u[n]$; ROC $|z|<|a|$ $\to$ anti-causal $-a^n u[-n-1]$.
- FIR is always stable: FIR filters have no feedback poles (all poles are at $z=0$), so they are unconditionally stable. This is the biggest advantage of FIR over IIR.
- Numerical precision: The closer the poles are to the unit circle, the more sensitive an IIR filter is to coefficient quantization. In 16-bit fixed-point implementations, poles with $|p| > 0.99$ may cause limit-cycle oscillations.
References: [1] Oppenheim & Schafer, Discrete-Time Signal Processing, Ch.3-6. [2] Proakis & Manolakis, Digital Signal Processing, Ch.3. [3] Mitra, Digital Signal Processing: A Computer-Based Approach.
✅ Quick Check
Q1: How do you tell from a pole-zero plot whether a frequency is amplified or attenuated?
Show answer
Pole close to some frequency on the unit circle → that frequency is amplified (resonance); zero close → that frequency is attenuated (notch).
Q2: What is the stability condition for an IIR filter?
Show answer
All poles lie inside the unit circle (|z|<1), equivalently the ROC contains the unit circle.
2.6 Sampling Theorem (Sampling Theorem)
Rigorously deriving Shannon's theorem from the Poisson summation formula
Why does this matter? Because sampling is the first step from analog to digital, and aliasing caused by a wrong sampling rate is an irreversible disaster. Why does CD audio use 44.1 kHz? Why does vibration analysis use 2.56x? Why does 5G oversample? All are direct applications of the Nyquist theorem.
Previously... So far we have assumed we already have a discrete sequence x[n]. But x[n] is obtained by sampling an analog signal x(t) — how high must the sampling rate be to avoid losing information?
One-line summary: If the sampling rate is not high enough, high frequencies disguise themselves as low frequencies — this is called aliasing, and it is irreversible.
Learning Objectives
- Derive sampling = frequency-domain periodization (Poisson summation formula)
- Derive the Nyquist condition $f_s \geq 2f_{\max}$ from periodization
- Derive the sinc reconstruction (Whittaker-Shannon interpolation) formula
- Understand the design of anti-aliasing filters in practical systems
The Problem: A Recording Engineer's Nightmare
Suppose you record a sound containing a 5000 Hz tone at $f_s = 8000$ Hz. On playback, what you hear is not 5000 Hz but 3000 Hz (= $8000 - 5000$).
- 5000 Hz exceeds the Nyquist frequency $f_s/2 = 4000$ Hz
- It is "folded" (aliased) to $f_s - 5000 = 3000$ Hz
- Irreversible: from the recorded data alone, you cannot distinguish between 3000 Hz and 5000 Hz
This is why every ADC must be preceded by an anti-aliasing filter. The sampling theorem tells you where to set the cutoff frequency of that filter.
Historical context: Harry Nyquist first proposed in his 1928 paper Certain Topics in Telegraph Transmission Theory that sampling at rate $f_s$ can represent signals with bandwidth at most $f_s/2$. Claude Shannon gave the full mathematical proof and the sinc reconstruction formula in his 1949 foundational information-theory paper. In Russia, V. A. Kotelnikov independently obtained the same result in 1933. This theorem is therefore sometimes called the Nyquist-Shannon-Kotelnikov sampling theorem.
Principles: Rigorous Derivation
Intuition first: Sampling is like stamping with a comb along the frequency axis — every $f_s$ Hz it prints a copy of the original spectrum. If the original spectrum is too wide, adjacent copies overlap (aliasing), just as stamps printed too densely blur the image.
From Poisson summation to Shannon's theorem
Step 1: Sampling = multiplication by an impulse train
$$x_s(t) = x(t)\cdot\sum_{n=-\infty}^{\infty}\delta(t - nT_s) = \sum_n x(nT_s)\,\delta(t-nT_s)$$Step 2: Frequency domain
The FT of an impulse train is another impulse train: $\mathcal{F}\{\sum_n\delta(t-nT_s)\} = \frac{2\pi}{T_s}\sum_k\delta(\omega-k\omega_s)$, where $\omega_s = 2\pi f_s$.
Time-domain multiplication = frequency-domain convolution:
$$X_s(\omega) = \frac{1}{2\pi}X(\omega) * \frac{2\pi}{T_s}\sum_k\delta(\omega-k\omega_s) = \frac{1}{T_s}\sum_{k=-\infty}^{\infty}X(\omega - k\omega_s)$$Sampling causes periodic repetition of the spectrum with spacing $\omega_s = 2\pi f_s$.
Step 3: Nyquist condition
If $X(\omega) = 0$ for $|\omega| > \omega_{\max}$ (band-limited signal) and $\omega_s > 2\omega_{\max}$, adjacent copies do not overlap. The original spectrum can be recovered using an ideal lowpass filter with gain $T_s$ and cutoff frequency $\omega_s/2$:
$$\boxed{f_s \geq 2f_{\max} \quad \text{(Nyquist Rate)}}$$Step 4: Sinc reconstruction (Whittaker-Shannon interpolation)
The impulse response of an ideal lowpass filter is the sinc function. Therefore:
$$x(t) = \sum_{n=-\infty}^{\infty}x(nT_s)\,\text{sinc}\!\left(\frac{t-nT_s}{T_s}\right)$$where $\text{sinc}(u) = \frac{\sin(\pi u)}{\pi u}$. Each sample "grows" a sinc waveform, and the superposition of all sincs exactly reconstructs the original continuous signal. $\;\blacksquare$
How to Use: Sampling Rate Design for Practical Systems
In theory $f_s \geq 2f_{\max}$ is enough. But in practice you need more, because:
- Anti-aliasing filters are not ideal: An ideal brick-wall lowpass filter is unrealizable. Real analog LPFs have a transition band and you need margin.
- Rules of thumb: $f_s \geq 2.56 \times f_{\max}$ (industry standard for vibration analysis), or the more conservative $f_s \geq 3{-}4 \times f_{\max}$.
- Place the anti-aliasing filter before the ADC (in the analog domain), with cutoff set at $f_s/2$ or slightly below.
| Application | $f_{\max}$ | $f_s$ (practical) | Ratio | Notes |
|---|---|---|---|---|
| CD audio | 20 kHz | 44.1 kHz | 2.2x | Human hearing limit 20 kHz |
| Professional audio | 20 kHz | 96 kHz | 4.8x | Simplifies AAF design |
| Vibration analysis | $f_{max}$ | $2.56 \times f_{max}$ | 2.56x | ISO/IEC standard |
| 5G baseband | 100 MHz | 245.76 MHz | 2.46x | 3GPP standard sampling rate |
| Sigma-Delta ADC | $f_b$ | $64{-}256 \times f_b$ | 64-256x | Oversampling trades for bit depth |
Applications
- Audio CD (44.1 kHz): The maximum perceivable frequency for the human ear is about 20 kHz. $44100/20000 = 2.205$. Why 44.1 instead of 40? To leave transition-band margin for the anti-aliasing filter. The historical reason for 44100 is related to NTSC video format.
- Mechanical vibration analysis (2.56x): Monitoring turbine bearing fault frequencies. If the highest frequency of interest is 10 kHz, the sampling rate is 25.6 kHz, using an 8th-order Butterworth AAF with cutoff at 10 kHz. $2.56 = 2 \times 1.28$; the 1.28 margin lets the AAF attenuate by more than 60 dB at Nyquist.
- Oversampling ADCs: Sigma-Delta ADCs sample at an extremely high rate (e.g., $256 \times f_b$), then use a digital decimation filter to bring the rate down to the target. The benefit: the anti-aliasing filter can be very simple (a single-pole RC is enough), because the transition band width is approximately $255 \times f_b$.
Interactive: Aliasing Demo
An 80 Hz sinusoid. Adjust the sampling rate $f_s$. When $f_s < 160$ Hz (below Nyquist), observe aliasing — the sampled signal looks like a different frequency.
Pitfalls & Limitations
- "If the sampling rate is high enough, no anti-aliasing filter is needed" — Wrong. Any signal plus environmental noise has theoretically infinite bandwidth. Without an AAF, high-frequency noise aliases in.
- "Sinc reconstruction gives exact recovery" — In theory yes, but the sinc function is infinitely long and unrealizable. Real systems use finite-length approximations (e.g., Lanczos kernel, polynomial interpolation).
- Bandpass sampling: If the signal is bandpass (e.g., an RF signal in $f_c \pm B/2$), the sampling rate only needs $f_s \geq 2B$ (not $2f_c$). But the choice of $f_s$ must avoid overlap between spectral copies, requiring more careful calculation.
- Aliasing is irreversible: Once aliasing has occurred, no post-processing (digital filtering, AI, etc.) can recover the original signal. Prevention is the only option.
References: [1] Shannon, Communication in the Presence of Noise, Proc. IRE, 1949. [2] Nyquist, Certain Topics in Telegraph Transmission Theory, Trans. AIEE, 1928. [3] Oppenheim & Schafer, Discrete-Time Signal Processing, Ch.4.
📝 Worked Example
A signal contains 100 Hz, 250 Hz, and 500 Hz components. (a) What is the minimum sampling rate? (b) If fs=800 Hz, is there aliasing? (c) If fs=900 Hz, where does the 500 Hz component alias to?
Show solution
(a) fs ≥ 2×500 = 1000 Hz
(b) fs=800 < 1000, so 500 Hz will alias. 250 Hz is unaffected (800 > 500).
(c) 500 Hz aliases to |500−900| = 400 Hz
✅ Quick Check
Q1: Why is the CD sampling rate 44.1 kHz?
Show answer
The human ear hears up to ~20 kHz, so Nyquist requires ≥40 kHz. 44.1 kHz provides ~10% margin for the anti-aliasing filter's transition band.
Q2: If you forgot the anti-aliasing filter before sampling, can you fix it afterwards?
Show answer
No. Aliasing is irreversible — high frequencies have already disguised themselves as low frequencies, and there is no way to distinguish genuine low frequencies from the aliased ones.
2.7 Transform Relationships Overview
The complete relationship chain of FS / CTFT / DTFT / DFT / Z-Transform
Why does this matter? Because FS, CTFT, DTFT, DFT, and the Z-transform are not five independent tools but five perspectives on the same story. Once you understand their relationships, you will never get lost on the question "which transform should I use?"
Previously... We have learned five transforms (FS, CTFT, DTFT, DFT, Z). They look like different tools, but there are precise mathematical relationships between them — once you understand the relationship map, you will stop confusing them.
One-line summary: FS, CTFT, DTFT, DFT, Z-transform — not five independent tools, but five perspectives on the same story.
Learning Objectives
- Understand the derivation relationships among the five transforms
- Master the duality "sampling → periodization" and "truncation → discretization"
- Choose the right transform tool based on signal characteristics
Relationship Chain: Five Branches of the Same Tree
Intuition first: All Fourier transforms do the same thing — decompose a signal into frequency components. The difference lies in whether the signal is continuous/discrete, periodic/aperiodic. Two core operations connect them:
Sampling $\to$ periodization in the frequency domain
Truncation / periodization $\to$ discretization in the frequency domain
Transform Relationship Diagram
| FS continuous periodic → discrete |
$\xrightarrow{T \to \infty}$ de-periodization |
CTFT continuous aperiodic → continuous |
||
| $\downarrow$ sampling frequency-domain periodization |
||||
| DTFT discrete aperiodic → continuous periodic |
$\xleftarrow{z = e^{j\omega}}$ unit-circle sampling |
Z-Transform discrete → $z$-plane |
||
| $\downarrow$ truncate to $N$ points frequency-domain discretization |
||||
| DFT discrete periodic → discrete periodic |
Complete Comparison Table
| Transform | Time Domain | Frequency Domain | Connecting Operation | Formula |
|---|---|---|---|---|
| FS | continuous, period $T$ | discrete ($c_n$) | — | $c_n = \frac{1}{T}\int_0^T f(t)e^{-jn\omega_0 t}dt$ |
| CTFT | continuous, aperiodic | continuous | $T\to\infty$ (FS → CTFT) | $F(\omega) = \int f(t)e^{-j\omega t}dt$ |
| DTFT | discrete, aperiodic | continuous, $2\pi$-periodic | sampling (CTFT → DTFT) | $X(e^{j\omega}) = \sum x[n]e^{-j\omega n}$ |
| DFT | discrete, $N$-periodic | discrete, $N$-periodic | truncate to $N$ points (DTFT → DFT) | $X[k] = \sum_{n=0}^{N-1}x[n]e^{-j2\pi kn/N}$ |
| Z-Transform | discrete | function on the $z$-plane | $z=e^{j\omega}$ → DTFT | $X(z) = \sum x[n]z^{-n}$ |
Key Insight: Duality Principle
Sampling ↔ periodization: Time-domain sampling (multiplication by an impulse train) causes frequency-domain periodization (spectral repetition). The converse also holds: time-domain periodization causes frequency-domain discretization (becoming discrete $c_n$).
Truncation ↔ discretization: Time-domain truncation (multiplication by a finite window) is equivalent to frequency-domain convolution (convolution with the window's FT, causing leakage). At the same time, the DTFT of a finite-length $N$ sequence can be fully represented by $N$ equally spaced samples — this is the DFT.
The DFT is the result of "double operations": both sampling (discrete in time domain) and truncation (finite length in time domain), so both time and frequency domains are discrete and periodic. This is why the DFT is the only version a computer can compute — because computers can only handle finitely many discrete numbers.
How to Use: Choosing the Right Transform Tool
| Your signal / need | Use | Reason |
|---|---|---|
| Continuous periodic signal (e.g., 60 Hz mains) | FS | Discrete harmonic structure, THD analysis |
| Continuous transient signal (theoretical analysis) | CTFT | Property derivations, filter theory |
| Theoretical spectrum of a discrete signal | DTFT | FIR/IIR frequency-response analysis |
| Finite-length discrete data (actual computation) | DFT/FFT | Only version a computer can compute |
| System stability / pole-zero analysis | Z-Transform | ROC determines stability, poles determine resonance |
| $s$-domain analysis of continuous systems | Laplace | Continuous counterpart of Z |
Practical rules of thumb:
- Theoretical derivations $\to$ CTFT / DTFT / Z (pen-and-paper tools)
- Writing code to compute spectra $\to$ DFT/FFT (the only computable version)
- Analyzing filter stability $\to$ poles of the Z-transform
- Interpreting the physical meaning of FFT results $\to$ back to DTFT and CTFT theory
"Translation" Examples Between Transforms
The following shows how the same physical problem moves between different transforms:
Scenario: Designing a digital lowpass filter
- CTFT: Define the ideal lowpass $H(\omega) = \text{rect}(\omega/2\omega_c)$, inverse FT gives $h(t) = \text{sinc}$
- Sampling: $h[n] = h(nT_s) = \text{sinc}(nT_s \omega_c/\pi)$ $\to$ enters the discrete world (DTFT)
- Truncation: $h_w[n] = h[n] \cdot w[n]$ (window), length $M$ $\to$ finite-length FIR (DFT)
- Z-transform: $H(z) = \sum_{n=0}^{M-1}h_w[n]z^{-n}$ (all zeros, no poles → FIR is always stable)
- FFT: Use an $N$-point FFT to compute $H(e^{j2\pi k/N})$ and verify the frequency response
Common Confusions
- DTFT $\neq$ DFT: The DTFT is a continuous frequency function (not directly computable); the DFT is its $N$-point sampling.
- Z-transform $\neq$ DTFT: The DTFT is a special case of the Z-transform on the unit circle. The ROC information of the Z-transform is "invisible" in the DTFT.
- FS $\neq$ DFT: FS is a continuous-time theory, while DFT is a finite-length discrete computation. But taking the DFT of a periodic signal over an integer number of periods yields a result equal to the FS coefficients multiplied by $N$.
References: [1] Oppenheim & Willsky, Signals and Systems, Ch.3-5. [2] Oppenheim & Schafer, Discrete-Time Signal Processing, Ch.2-8. [3] Haykin & Van Veen, Signals and Systems.
Interactive: Transform Relationship Diagram
Click a node or hover over it to see the derivation relationships among the transforms.
3.1 Periodogram & Window Functions
The root cause of spectral leakage and the engineering trade-offs of windows
Why does this matter? Because computing an FFT without windowing is like looking at the spectrum through a pair of severely astigmatic glasses — the sidelobes from leakage will completely swamp weak signals. Proper use of windows is the first step toward reliable spectral analysis.
Previously: Part II gave us the transform tools. But directly feeding a signal into the FFT to compute its spectrum (periodogram) produces results severely distorted by truncation effects. Window functions are the first step in solving this problem.
Learning Objectives
- Derive that truncation = frequency-domain convolution, and understand the cause of leakage
- Compare performance metrics of Rectangular / Hann / Hamming / Blackman / Kaiser windows
- Choose the optimal window based on resolution vs. dynamic range requirements
- Understand the statistical properties of the periodogram as a PSD estimator
One-Sentence Summary
The periodogram is simply "feed the signal into an FFT and square the result" — simple but crude. Window functions are the key to making it less crude.
Pain Point: "Ghost Artifacts" in the Spectrum
You analyze a signal with the FFT — it contains only a single 100 Hz sine wave, yet the spectrum sprouts a bunch of ghost artifacts at 80, 90, 110, 120 Hz and beyond. This is spectral leakage.
A more serious scenario: you want to detect a weak 1.05 kHz signal next to a strong 1 kHz signal (e.g., harmonic distortion analysis), but the leakage sidelobes completely swamp the weak signal. This is not a hardware problem — it is a mathematical inevitability.
Origin
Arthur Schuster (1898) first proposed the concept of the periodogram for analyzing the periodicity of sunspot activity. He applied Fourier analysis directly to the observed data, computing the "intensity" at each frequency — a simple and intuitive idea, but with poor statistical properties (not fully understood until the mid-20th century).
Systematic study of window functions came with Fredric J. Harris (1978) and his classic paper "On the Use of Windows for Harmonic Analysis with the Discrete Fourier Transform". Harris systematically compared the frequency-domain characteristics of over 20 window functions and established engineering criteria for window selection. This paper remains one of the most cited in the field (Google Scholar citations exceeding 10,000).
Jim Kaiser at Bell Labs developed the Bessel-function-based Kaiser window, whose unique feature is that a single parameter $\beta$ continuously adjusts the trade-off between mainlobe width and sidelobe attenuation — turning window selection from "pick one from a pile of fixed windows" into "turn a knob to your desired balance point."
Principle: Why Does Truncation Cause Leakage?
Intuition: You only observe a finite duration of the signal. Mathematically, this is equivalent to multiplying an infinitely long signal by a rectangular function (1 inside the observation window, 0 outside). Multiplication in time = convolution in frequency. The spectrum of the rectangular function is a sinc (with infinitely many sidelobes), so the originally clean spectral line gets "smeared" by the sinc sidelobes — this is leakage.
Truncation = multiply by rectangular window = convolve with sinc in frequency
$$x_w[n] = x[n] \cdot w[n] \;\longleftrightarrow\; X_w(e^{j\omega}) = \frac{1}{2\pi}\,X(e^{j\omega}) * W(e^{j\omega})$$The DTFT of the rectangular window is the Dirichlet kernel (discrete sinc):
Its first sidelobe is only -13 dB below the mainlobe. This means: if there is a strong signal, the energy it leaks into adjacent frequencies is only 13 dB below itself — this will completely swamp any nearby signal that is more than 13 dB weaker.
Expand: Why can other windows improve sidelobes?
All windows share the same core design philosophy: sacrifice mainlobe width to gain sidelobe attenuation.
The problem with the rectangular window is that it abruptly truncates to zero at the edges, causing the Gibbs phenomenon in the frequency domain (slow sidelobe decay). If the window function smoothly tapers to zero at the edges (like the cosine shape of the Hann window), the sidelobes in the frequency domain decay much faster.
Mathematically, the Hann window can be expressed as a linear combination of three rectangular windows:
$$w_{\text{Hann}}[n] = 0.5\,w_{\text{rect}}[n] - 0.25\,w_{\text{rect}}[n]\,e^{j2\pi n/(N-1)} - 0.25\,w_{\text{rect}}[n]\,e^{-j2\pi n/(N-1)}$$Therefore, the DTFT of the Hann window is a superposition of three shifted Dirichlet kernels. In the sidelobe region, these three terms approximately cancel each other (destructive interference), causing the sidelobes to drop rapidly.
The cost is that the mainlobe widens from 2 bins to 4 bins — the minimum distance between two resolvable frequencies doubles.
More generally, the higher the order of continuous derivatives at the window edges, the faster the sidelobe decay rate:
$$\text{At the edges, } w^{(k)}(0) = w^{(k)}(N-1) = 0 \text{ for } k = 0, 1, \ldots, m \implies \text{sidelobes} \sim O(\omega^{-(m+2)})$$$\blacksquare$
Periodogram: Definition and Statistical Properties
The simplest spectral estimate: take the squared magnitude of the windowed DFT.
Expand: Proof that the periodogram is an inconsistent estimator
Expected value of the periodogram:
$$E[\hat{S}(\omega)] = \frac{1}{2\pi}S(\omega) * |W(\omega)|^2$$This is the convolution of the true PSD $S(\omega)$ with the window power spectrum — biased, but as $N \to \infty$, $|W|^2$ approaches a delta function and the bias vanishes.
However, the variance is where the problem lies. It can be shown (using Bartlett's formula) that:
$$\text{Var}[\hat{S}(\omega)] \approx S^2(\omega) \quad (\text{does not decrease with } N\text{!})$$That is, the relative standard deviation is $\approx 100\%$, regardless of how large $N$ is. Increasing $N$ only lets you see equally violent random fluctuations on a finer frequency grid.
This is why the periodogram is an "inconsistent estimator" — Welch's method or the Multitaper method is needed to reduce the variance. $\;\blacksquare$
Window Function Formulas
Rectangular: $w[n] = 1, \quad 0 \leq n \leq N-1$
Hann:$w[n] = 0.5\!\left(1 - \cos\frac{2\pi n}{N-1}\right)$
Hamming:$w[n] = 0.54 - 0.46\cos\frac{2\pi n}{N-1}$
Blackman:$w[n] = 0.42 - 0.5\cos\frac{2\pi n}{N-1} + 0.08\cos\frac{4\pi n}{N-1}$
Blackman-Harris (4-term):$w[n] = 0.35875 - 0.48829\cos\frac{2\pi n}{N-1} + 0.14128\cos\frac{4\pi n}{N-1} - 0.01168\cos\frac{6\pi n}{N-1}$
Flat-top: Coefficients designed to make the mainlobe top flat, sacrificing frequency resolution for amplitude accuracy ($< 0.01$ dB error)
Kaiser: $w[n] = \frac{I_0\!\left(\beta\sqrt{1 - \left(\frac{2n}{N-1} - 1\right)^2}\right)}{I_0(\beta)}$, where $I_0$ is the modified Bessel function
Comparison of Five Windows
| Window | Mainlobe Width (bins) | Highest Sidelobe (dB) | Sidelobe Decay Rate | ENBW (bins) | Typical Use |
|---|---|---|---|---|---|
| Rectangular | 2 | -13 | -6 dB/oct | 1.00 | Transient analysis, resolution priority |
| Hann | 4 | -31 | -18 dB/oct | 1.50 | General-purpose default, audio analysis |
| Hamming | 4 | -42 | -6 dB/oct | 1.36 | Speech analysis, FIR design |
| Blackman | 6 | -58 | -18 dB/oct | 1.73 | High dynamic range, radar sidelobe suppression |
| Kaiser ($\beta$=6) | ~5 | -46 | Adjustable | ~1.5 | Adjustable trade-off, filter design |
| Flat-top | ~10 | -44 | -6 dB/oct | 3.77 | Calibration, precise amplitude measurement |
ENBW (Equivalent Noise Bandwidth): Indicates how much noise power the window lets through. ENBW = 1.0 is ideal (only the rectangular window achieves this); other windows with ENBW > 1 mean that noise power is amplified. When measuring PSD, ENBW correction is required.
How to Use: Window Selection Decision Tree
What is your requirement?
- Need to resolve two close frequencies? → Rectangular or Kaiser (small $\beta$, e.g. 2~4)
- Need to see a weak signal next to a strong one (high dynamic range)? → Blackman or Kaiser (large $\beta$, e.g. 10~14) or Blackman-Harris
- Need precise amplitude measurement of frequency components? → Flat-top
- Unsure / general purpose? → Hann (almost always a safe choice)
- Want continuously adjustable trade-off? → Kaiser (adjust $\beta$ from 0 to 20 for continuous coverage of all trade-offs)
Kaiser $\beta$ rules of thumb: $\beta = 0$ → rectangular window; $\beta \approx 5$ → approximately Hamming; $\beta \approx 6$ → approximately Hann; $\beta \approx 8.5$ → approximately Blackman; $\beta > 10$ → exceeds Blackman's dynamic range.
Python Example: Comparing Five Window Functions
Application Scenarios
- THD (Total Harmonic Distortion) analysis: Measuring harmonic distortion of an audio amplifier with a 1 kHz fundamental, needing to see 2nd and 3rd harmonics 90 dB below the fundamental. Use Hann or Flat-top window (Flat-top amplitude error < 0.01 dB; prefer Flat-top if frequency resolution is sufficient). Industry standard: AES17 specifies a Flat-top window for THD+N measurements.
- Audio spectrum analysis (DAW / mixing console): Real-time display of a music signal's spectrum. Typically uses the Hann window, NFFT = 4096~8192 (at 44.1 kHz sampling rate, frequency resolution is about 5~10 Hz). The Hann window's sidelobes are sufficiently low (-31 dB), causing no serious artifacts in the display.
- Radar sidelobe suppression: In radar echoes, sidelobes of strong targets (e.g., large ships) can mask nearby weak targets (e.g., small speedboats). Use the Blackman-Harris 4-term window (sidelobes -92 dB) or the Dolph-Chebyshev window (equiripple design). The cost is a wider mainlobe (range resolution drops by about 50%), but this is worthwhile in scenarios requiring very high dynamic range.
Pitfalls and Limitations
- Forgetting to window = rectangular window = -13 dB sidelobes: This is the most common mistake. Many beginners use
np.fft.fft(x)directly without windowing, then wonder why the spectrum looks so "dirty." Unless the signal is captured at exactly an integer number of periods, there will always be leakage. - Flat-top window: accurate amplitude but poor frequency resolution: The mainlobe width is about 10 bins, 2.5 times wider than Hann. Two close frequencies will blur together. Only suitable for scenarios where frequency components are known and well-separated (e.g., calibration measurements).
- Window functions reduce the effective data length: Data near the edges is down-weighted by the window function. Effective data length ≈ ENBW × $N / f_s$ seconds. Overlap processing can partially compensate for this loss.
- Coherent gain correction: After windowing, you must divide by the window's mean value (coherent gain = $\frac{1}{N}\sum w[n]$) to correctly read the amplitude of a single frequency component. Forgetting this correction causes amplitude readings to be too low.
When Not to Use the Periodogram?
- Need a stable PSD estimate: The periodogram's variance does not decrease with more data → use Welch's method (Section 3.2) or Multitaper
- Data is extremely short (tens of points) and high resolution is needed: FFT resolution = $f_s/N$ is insufficient → use AR parametric model (Section 3.3)
- Need to resolve extremely close sinusoids (super-resolution): Even windowing cannot break the FFT resolution limit → use MUSIC / ESPRIT (Section 3.4)
- Only care about details in a narrow frequency band: A full-range FFT wastes computational resources → use Chirp-Z Transform (Section 3.5)
Interactive: Window Spectrum Comparison
Select a window function and observe its time-domain shape and frequency response (dB scale). Note the trade-off between mainlobe width and sidelobe height.
Interactive: Leakage and Resolution
Two sinusoids with close frequencies — switch windows to compare FFT results in real time. When the two frequencies are too close, some windows' mainlobes are too wide and merge the two peaks into one.
References: [1] Schuster, On the Investigation of Hidden Periodicities with Application to a Supposed 26 Day Period of Meteorological Phenomena, Terr. Magn., 1898. [2] Harris, On the Use of Windows for Harmonic Analysis with the DFT, Proc. IEEE, 1978. [3] Kaiser & Schafer, On the Use of the I₀-Sinh Window for Spectrum Analysis, IEEE Trans. ASSP, 1980. [4] Oppenheim & Schafer, Discrete-Time Signal Processing, Ch.10.
📝 Worked Example
You need to analyze two sinusoids: 100 Hz (0 dB) and 108 Hz (-35 dB). Sampling rate 1000 Hz, observation duration 0.256 s (256 points). Can you resolve them without windowing? With a Hann window? With a Blackman window?
Show solution
Δf = 1000/256 = 3.91 Hz. Frequency spacing 8 Hz > Δf, so theoretically resolvable.
But the weak signal is only -35 dB:
(1) Rectangular window sidelobes -13 dB → leakage from 100 Hz at 108 Hz is about -13 dB, much stronger than the -35 dB true signal → cannot see it
(2) Hann window sidelobes -31 dB → leakage -31 dB, still stronger than -35 dB → barely visible
(3) Blackman window sidelobes -58 dB → leakage -58 dB, well below -35 dB → clearly visible
✅ Quick Check
Q1: You need to detect a weak signal next to a strong one (40 dB difference). Which window should you use?
Show answer
Blackman window (sidelobes -58 dB) or Kaiser (with large β). The Hann window's sidelobes are only -31 dB, which is not enough.
Q2: What does computing an FFT without windowing amount to?
Show answer
It amounts to using a rectangular window, with sidelobes only -13 dB — almost guaranteed to cause severe spectral leakage.
3.2 Welch's Method & Multitaper Spectral Estimation
Two major approaches to reducing PSD estimation variance
Why does this matter? Because the spectrum computed from a single FFT differs every time (high variance), and cannot be directly used to set alarm thresholds or make statistical comparisons. Welch's method makes the results stable enough for engineering decisions.
Previously: Section 3.1's windows solved the leakage problem, but the periodogram has another fatal flaw: the variance is too large (results differ every time). How do we stabilize the spectral estimate?
Learning Objectives
- Understand Welch's method: bias-variance trade-off of segment-window-average
- Master the parameter selection process for segment length, overlap ratio, and window
- Learn about DPSS (Slepian) sequences and the Multitaper method
- Compare applicable scenarios for Welch and Multitaper
One-Sentence Summary
The core idea of Welch's method is extremely simple: divide a long data record into multiple segments, compute the spectrum of each segment, then average — stabilizing the result.
Pain Point: Unstable Spectrum
You compute the PSD of a vibration signal using the periodogram, and the result looks different every time — wild fluctuations that look more like noise than a clean spectrum. You cannot use such unstable results to set machine monitoring alarm thresholds, nor can you reliably compare two measurements.
This is because the periodogram is an inconsistent estimator — no matter how long the data you collect, the relative fluctuation of the estimate remains around 100%. You need a method to "smooth out" these fluctuations.
Origin
M.S. Bartlett (1948) first proposed the idea of segment averaging: divide the data into $K$ non-overlapping segments, compute the periodogram of each, then average. This simply reduces the variance to $\approx 1/K$.
Peter Welch (1967) at IBM made two key improvements to Bartlett's method: (1) allowing segments to overlap, squeezing more segments from the same data length; (2) applying a window function to each segment (instead of a rectangular window), reducing leakage. Because of its simplicity and effectiveness, this method became the de facto standard for computing PSD in engineering. Python's scipy.signal.welch() and MATLAB's pwelch() are both based on it.
David Thomson (1982) at Bell Labs proposed a completely different approach — the Multitaper method: instead of segmenting, use multiple orthogonal windows (DPSS / Slepian sequences) to perform multiple windowed FFTs on the same data, then average. This avoids sacrificing frequency resolution (because no segmenting), and is the gold standard for PSD estimation of short data.
Principle: Welch's Method
Intuition: A single photo may have noise, but averaging $K$ photos produces a clean image. Welch's method does the same for spectra — divide the data into segments, compute a "spectral photo" for each, then average.
Welch PSD Estimate
$$\hat{S}_W(f) = \frac{1}{K}\sum_{i=0}^{K-1}\frac{1}{LU}\left|\sum_{n=0}^{L-1}w[n]\,x[n+iD]\,e^{-j2\pi fn/f_s}\right|^2$$$L$: segment length, $D$: hop size ($D = L - \text{overlap}$), $K$: number of segments, $U = \frac{1}{L}\sum|w[n]|^2$ (window power normalization)
Core Trade-off (Bias-Variance Tradeoff):
- More segments $K$ → lower variance ($\approx 1/K$) → more stable spectrum
- But for fixed total data length $N$, more $K$ → shorter segments ($L$ smaller) → worse frequency resolution $\Delta f = f_s/L$
- This is unavoidable: frequency resolution × stability = constant (determined by total data length $N$)
Expand: Derivation of equivalent degrees of freedom for Welch's method
Each segment's periodogram approximately follows a $\chi^2_2$ distribution (2 degrees of freedom). After averaging $K$ segments:
$$\hat{S}_W(f) \sim \frac{S(f)}{K_{\text{eff}}}\,\chi^2_{2K_{\text{eff}}}$$where $K_{\text{eff}}$ is the effective number of independent segments. If segments do not overlap (Bartlett), $K_{\text{eff}} = K$. With 50% overlap + Hann window:
$$K_{\text{eff}} \approx \frac{K}{1 + 2\sum_{k=1}^{K-1}(1-k/K)\rho_k^2}$$where $\rho_k$ is the correlation coefficient between the $k$-th pair of adjacent windowed segments. For Hann window with 50% overlap, $\rho_1 \approx 0.167$; more distant segments are nearly uncorrelated.
Empirical conclusion: Hann + 50% overlap yields about 1.6 times the equivalent degrees of freedom of the non-overlapping version — squeezing 60% more statistical independence from the same data.
Normalized standard deviation (relative error): $\epsilon = 1/\sqrt{K_{\text{eff}}}$. Engineering rule of thumb: $\epsilon < 0.1$ (i.e., $K_{\text{eff}} > 100$) to be considered "stable." $\;\blacksquare$
How to Use: Four-Step Parameter Selection
Step 1: Determine frequency resolution $\Delta f$
Segment length $L = f_s / \Delta f$. Example: $f_s = 10\,\text{kHz}$, need $\Delta f = 1\,\text{Hz}$ → $L = 10000$ points.
Step 2: Choose overlap ratio
Hann window: 50% overlap ($D = L/2$). Hamming window: 67% overlap ($D = L/3$). Rule of thumb: overlap ratio = $1 - 1/\alpha$, where $\alpha$ is the ENBW.
Step 3: Choose window
Usually Hann. For higher dynamic range, use Blackman.
Step 4: Compute segment count and stability
$K = \lfloor(N - L)/(L - \text{overlap}) + 1\rfloor$. Equivalent degrees of freedom $\nu \approx 2K \times (\text{overlap correction})$. Normalized error $\epsilon \approx 1/\sqrt{K_{\text{eff}}}$.
Concrete example: Vibration monitoring
- Accelerometer sampling rate $f_s = 10\,\text{kHz}$
- Want frequency resolution $\Delta f = 1\,\text{Hz}$ → $L = 10000$ points (1 second)
- Hann window, 50% overlap → $D = 5000$
- Collect 10 seconds of data ($N = 100000$)
- Number of segments $K = (100000 - 10000)/5000 + 1 = 19$ segments
- Equivalent DOF $\nu \approx 2 \times 19 \times 0.85 \approx 32$ (normalized error $\approx 18\%$)
- If more stability is needed ($\epsilon < 10\%$), collect 30 seconds of data → $K \approx 59$, $\epsilon \approx 10\%$
Multitaper Method (Thomson, 1982)
Intuition: Welch's method shortens the data to gain more segments for averaging, sacrificing resolution. Can we use all the data to preserve resolution while still averaging multiple estimates? Yes — use different windows on the same data for FFT. But these windows must be orthogonal; otherwise the results are not independent and averaging has no effect.
DPSS (Discrete Prolate Spheroidal Sequences / Slepian sequences) are exactly such an orthogonal window family: they are the sequences with the highest energy concentration within a given bandwidth $NW/N$. The first $K \approx 2NW$ DPSS are nearly perfectly orthogonal, with almost all their energy concentrated in the target bandwidth.
$\{v^{(k)}\}_{k=0}^{K-1}$: first $K$ DPSS, $NW$: half-bandwidth parameter (commonly $NW = 3$ or 4)
| Property | Welch | Multitaper |
|---|---|---|
| Variance reduction method | Segment averaging | Multi-window averaging (no segmenting) |
| Frequency resolution | $f_s/L$ ($L$ = segment length < $N$) | $2NW \cdot f_s/N$ ($\approx f_s/N$ level) |
| Equivalent DOF | $\approx 2K$ | $\approx 2K$ ($K \approx 2NW$) |
| Best scenario | Long data | Short data (preserves full resolution) |
| Computational cost | $K$ FFTs ($L$-point) | $K$ FFTs ($N$-point) + DPSS computation |
| Implementation | scipy.signal.welch() | spectrum.pmtm() / nitime |
Gold Standard: Multitaper requires no segmenting (preserving full frequency resolution) yet still reduces variance. For PSD estimation of short data (hundreds to thousands of points), it is the recognized optimal method. The cost: DPSS must be precomputed (but only once), and $K$ is limited to $\approx 2NW$ (typically 5~8), unlike Welch which can have tens or even hundreds of segments.
Application Scenarios
- Vibration PSD trend monitoring (Welch): A wind turbine gearbox takes a 60-second acceleration data segment every hour ($f_s = 25.6\,\text{kHz}$), and uses Welch's method ($L = 25600$, 50% overlap, Hann) to compute a stable PSD. Daily comparison of PSD energy in specific frequency bands reveals upward trends → early fault warning.
- Communication system noise floor measurement (Welch): Measuring an RF receiver's noise power spectral density $N_0$. Requires a very stable PSD estimate ($\epsilon < 5\%$), typically using long acquisitions + Welch's method with $K > 400$ segments.
- Neuroscience LFP/EEG power spectra (Multitaper): EEG signal trials are typically only 1-2 seconds long. Multitaper ($NW = 4$, $K = 7$ tapers) provides a stable PSD estimate while preserving $\sim 2\,\text{Hz}$ resolution. This is the standard practice in the neuroscience community.
Pitfalls and Limitations
- Segments too short → frequency blurring: If $\Delta f = f_s/L = 100\,\text{Hz}$, but the two frequencies you want to see differ by only 50 Hz, they will blur together. Always verify that $L$ corresponds to a $\Delta f$ that meets your requirements.
- Too few segments → still unstable: $K = 3$ segments give only 6 degrees of freedom; the PSD estimate remains very noisy. Practical minimum: $K \geq 8$ (16 DOF) before it becomes useful.
- Too much overlap → segments not independent: 90% overlap appears to give many segments, but adjacent segments share nearly identical data, greatly reducing the averaging benefit. Hann + 50% is the optimal balance point.
- Multitaper $NW$ selection: $NW$ too large → poor resolution (equivalent bandwidth $= 2NW \cdot f_s/N$); $NW$ too small → poor DPSS quality (high-order taper energy leakage). Typically $NW = 3$ or 4.
When Not to Use?
- Signal is non-stationary: Welch assumes stationary statistics within each segment. If the signal's frequency changes over time (e.g., chirp), Welch averages spectra from different time instants together → use STFT / spectrogram (Section 5.1) instead
- Only need to detect discrete sinusoidal frequencies (no continuous PSD needed): → use MUSIC / ESPRIT (Section 3.4) for greater accuracy
- Data is extremely short and the signal model is known (e.g., speech): → use AR parametric model (Section 3.3) instead
References: [1] Bartlett, Smoothing Periodograms from Time-Series with Continuous Spectra, Nature, 1948. [2] Welch, The Use of FFT for the Estimation of Power Spectra, IEEE Trans. Audio Electroacoustics, 1967. [3] Thomson, Spectrum Estimation and Harmonic Analysis, Proc. IEEE, 1982. [4] Percival & Walden, Spectral Analysis for Physical Applications, Cambridge, 1993.
📝 Worked Example
Vibration monitoring: fs=10 kHz, want Δf=2 Hz PSD. Using Hann window with 50% overlap. (a) Segment length? (b) How many segments in 10 seconds of data? (c) Equivalent degrees of freedom?
Show solution
(a) L = fs/Δf = 10000/2 = 5000 points
(b) hop = 5000×0.5 = 2500, K = floor((100000−5000)/2500)+1 = 39 segments
(c) Equivalent DOF for Hann 50% overlap ≈ 2×39×(8/3) ≈ not fully independent, effective segments ≈ 39×0.67 ≈ 26, DOF ≈ 52
✅ Quick Check
Q1: Welch's method with segment length L=1000, sampling rate fs=10 kHz — what is the frequency resolution?
Show answer
Δf = fs/L = 10000/1000 = 10 Hz。
Q2: If the number of segments doubles (K→2K), by roughly what factor does the PSD estimation variance change?
Show answer
Approximately halved (~1/K), provided the segments are approximately independent.
Interactive: Welch's Method vs Single FFT
Same noisy signal (containing two sinusoids), comparing the periodogram's "jagged" look vs Welch's "smooth" result. Each click regenerates the noise.
3.3 Parametric Spectral Estimation (AR Model)
Model fitting instead of direct FFT — high-resolution spectra from short data
Why does this matter? Because when data is short (only a few dozen samples), the FFT's frequency resolution is too poor. The AR model can squeeze higher resolution from short data than the FFT — this is especially important in speech analysis (20 ms per frame) and biomedical signals.
Previously: Section 3.2's Welch method reduces variance by segment averaging, but sacrifices frequency resolution. If the data is short, is there a way to avoid sacrificing resolution?
Learning Objectives
- Establish the equivalence between the AR(p) model and linear prediction
- Derive the Yule-Walker equations and Levinson-Durbin recursion
- Understand the lattice structure of the Burg algorithm
- Master AIC/BIC criteria for AR order selection
- Compare the resolution of AR spectra versus the periodogram
One-Sentence Summary
Rather than computing the FFT directly, assume the signal is generated by a model (AR model), and use the model to infer the spectrum — even short data can yield a smooth, high-resolution result.
Pain Point: The Resolution Bottleneck of Short Data
Your data has only a few dozen samples (e.g., a 20 ms speech frame at 8 kHz sampling rate has only 160 points), giving an FFT frequency resolution of $\Delta f = f_s/N = 8000/160 = 50\,\text{Hz}$. If two speech formants are at 500 Hz and 530 Hz, only 30 Hz apart — far less than the 50 Hz resolution — the FFT will show a single blurry wide peak, completely unable to separate them.
You cannot collect longer data (speech is non-stationary; beyond 20-30 ms the statistical properties change), and zero-padding only interpolates without truly increasing resolution. You need a method that can "squeeze" more spectral information from short data.
Origin
The story of the AR model traces back to G. Udny Yule (1927) and Gilbert Walker (1931), who developed the estimation theory for autoregressive models (Yule-Walker equations) to analyze the quasi-periodicity of sunspots and the periodic patterns of Indian monsoons.
Norman Levinson (1947) and later James Durbin (1960) developed an efficient recursive solution (Levinson-Durbin recursion), reducing computational complexity from $O(p^3)$ (direct solution of linear equations) to $O(p^2)$.
The person who truly brought the AR model into spectral estimation was John Parker Burg (1967), who proposed the Maximum Entropy Method (MEM) in his Ph.D. dissertation at Stanford. Burg's insight was: given limited autocorrelation data, among all consistent PSDs, the one with "maximum information entropy" is exactly the PSD corresponding to the AR model. This gave AR spectral estimation an information-theoretic foundation. Burg's advisor was Robert White, whose research was motivated by short-data spectral analysis in geophysics — seismic exploration data is often short, and FFT resolution is insufficient.
In speech processing, the AR model is widely known by the name LPC (Linear Predictive Coding). Itakura (1968) and Atal & Hanauer (1971) applied the AR model to speech analysis and coding, ushering in the era of digital speech communication. The core of early GSM mobile voice coding (RPE-LTP) and CELP coding is the AR model.
Principle: AR(p) Model
Intuition: The AR model assumes that each sample of the signal can be predicted by a linear combination of the past $p$ samples plus white noise. If the prediction is good, the residual is white noise. Viewed in reverse, the linear predictor is an "all-pole filter," with white noise passing through it to produce the observed signal. Each pair of conjugate poles produces a peak in the spectrum.
AR(p) Difference Equation
$$x[n] = -\sum_{k=1}^{p}a_k\,x[n-k] + e[n], \quad e[n] \sim \text{WN}(0, \sigma^2)$$Equivalent representation: $A(z)\,X(z) = E(z)$, where $A(z) = 1 + a_1 z^{-1} + \cdots + a_p z^{-p}$
PSD of the AR model:
Since the denominator is the squared magnitude of a polynomial, $S_{AR}(f)$ has only peaks (corresponding to poles near the unit circle), and no zeros (valleys). Each pair of conjugate poles $r\,e^{\pm j\theta}$ produces a peak at $f = \theta f_s/(2\pi)$; the closer the pole radius $r$ is to 1, the sharper the peak.
Yule-Walker Equations
Expand derivation
Multiply both sides of the AR equation by $x^*[n-m]$ and take the expectation:
$$E[x[n]\,x^*[n-m]] = -\sum_{k=1}^{p}a_k\,E[x[n-k]\,x^*[n-m]] + E[e[n]\,x^*[n-m]]$$Left side = $r_{xx}[m]$ (autocorrelation function). First term on the right = $-\sum a_k\,r_{xx}[m-k]$.
Second term on the right: because $e[n]$ is only correlated with $x[n], x[n-1], \ldots$ (not with future values):
- $m = 0$: $E[e[n]\,x^*[n]] = \sigma^2$ ($e[n]$ is part of $x[n]$)
- $m \geq 1$: $E[e[n]\,x^*[n-m]] = 0$ ($e[n]$ is uncorrelated with past $x$)
For $m = 1, 2, \ldots, p$, we obtain the Yule-Walker linear system:
$$\underbrace{\begin{bmatrix}r[0]&r[-1]&\cdots&r[1-p]\\r[1]&r[0]&\cdots&r[2-p]\\\vdots&&\ddots&\vdots\\r[p-1]&\cdots&&r[0]\end{bmatrix}}_{\mathbf{R}\;(\text{Toeplitz})}\begin{bmatrix}a_1\\a_2\\\vdots\\a_p\end{bmatrix} = -\begin{bmatrix}r[1]\\r[2]\\\vdots\\r[p]\end{bmatrix}$$$m = 0$: $\sigma^2 = r[0] + \sum_{k=1}^{p}a_k\,r[-k]$ (white noise power).
$\mathbf{R}$ is a Toeplitz positive-definite matrix (since it is an autocorrelation matrix), solvable by the Levinson-Durbin recursion in $O(p^2)$ (vs. $O(p^3)$ for general linear systems). $\;\blacksquare$
Levinson-Durbin Recursion
Expand algorithm steps
Progressively build from AR(1) up to AR(p):
Initialization ($m=0$): $\sigma_0^2 = r[0]$
Recursion ($m = 1, 2, \ldots, p$):
$$k_m = -\frac{r[m] + \sum_{i=1}^{m-1}a_i^{(m-1)}\,r[m-i]}{\sigma_{m-1}^2} \quad \text{(reflection coefficient)}$$ $$a_m^{(m)} = k_m$$ $$a_i^{(m)} = a_i^{(m-1)} + k_m\,a_{m-i}^{(m-1)}, \quad i = 1, \ldots, m-1$$ $$\sigma_m^2 = (1 - |k_m|^2)\,\sigma_{m-1}^2$$Stability guarantee: If $|k_m| < 1$ holds for all $m$ (guaranteed by the autocorrelation method), then the AR model is stable (all poles inside the unit circle).
At each step, $\sigma_m^2$ is the prediction error power of AR($m$) — decreasing as $m$ increases. When adding one more order causes $\sigma_m^2$ to barely decrease, that is an indicator of the optimal order. $\;\blacksquare$
Order Selection: Information-Theoretic Foundation of AIC and BIC
Expand derivation: Information-theoretic foundation of AIC and BIC
AIC and BIC are not arbitrary formulas — they come from the statistical principle of "balancing goodness of fit against model complexity."
AIC (Akaike Information Criterion, 1974):
$$\text{AIC}(p) = -2\ln L(\hat{\theta}_p) + 2p$$where $L$ is the likelihood and $p$ is the number of parameters.
Origin: Akaike proved that this formula is an unbiased estimator of the KL divergence (Kullback-Leibler divergence):
$$\text{AIC} \approx 2N \cdot D_{KL}(\text{true distribution} \| \text{model})$$The first term $-2\ln L$ measures "how well the model fits the data" (smaller is better).
The second term $2p$ is the "complexity penalty" — each additional parameter costs 2 points, preventing overfitting.
For an AR(p) model with Gaussian residuals, it simplifies to $\text{AIC}(p) = N\ln\hat{\sigma}_p^2 + 2p$.
BIC (Bayesian Information Criterion, 1978):
$$\text{BIC}(p) = -2\ln L(\hat{\theta}_p) + p\ln N$$The second term is $p\ln N$ rather than $2p$.
Origin: BIC comes from Bayes' theorem — it is an approximation of the negative log of the posterior probability $P(\text{model}|\text{data})$.
As $N \to \infty$, $\ln N$ grows larger than $2$, so BIC imposes a stronger complexity penalty and tends to choose simpler models.
Which one to choose?
| Criterion | Characteristics | Suitable for |
|---|---|---|
| AIC | Tends to pick more complex models | Prediction-focused tasks (capture detail) |
| BIC | Tends to pick simpler models; asymptotically consistent | Explanation-focused tasks (find the true model order) |
How to Use: Four-Step Process
Step 1: Choose AR order $p$
Rule of thumb: $p \approx 2 \times$ (expected number of spectral peaks). For example, speech has 4~5 formants → $p \approx 10$, plus glottal pulse and radiation effects → typically $p = 10$~$14$ (at 8 kHz sampling rate).
Information criteria: AIC = $N\ln\sigma_p^2 + 2p$; BIC = $N\ln\sigma_p^2 + p\ln N$. Choose $p$ that minimizes AIC/BIC. BIC penalizes high orders more heavily, favoring lower $p$.
Step 2: Compute autocorrelation $r[0], r[1], \ldots, r[p]$
$r[k] = \frac{1}{N}\sum_{n=0}^{N-1-k}x[n+k]\,x^*[n]$ (biased estimator, but guarantees the Toeplitz matrix is positive-definite).
Step 3: Solve Yule-Walker (Levinson-Durbin) for $\{a_k\}$ and $\sigma^2$
Or use the Burg algorithm (no need to compute autocorrelation first; estimates reflection coefficients directly from data with better statistical properties).
Step 4: Compute PSD
Evaluate $S_{AR}(f) = \sigma^2/|A(e^{j2\pi f/f_s})|^2$ on a dense frequency grid.
Concrete example: Speech formant analysis
- Sampling rate $f_s = 8\,\text{kHz}$, speech frame length 20 ms → $N = 160$ points
- FFT resolution = $8000/160 = 50\,\text{Hz}$ (too coarse!)
- Choose AR(12): expect 4 formants + 2 extra poles (glottal + radiation) = 12
- Levinson-Durbin solution → yields 12 AR coefficients
- Compute PSD → smooth spectral envelope clearly showing $F_1 \approx 500\,\text{Hz}$, $F_2 \approx 1500\,\text{Hz}$, $F_3 \approx 2500\,\text{Hz}$ formants
- This is the core of LPC (Linear Predictive Coding)!
Application Scenarios
- Speech analysis and coding (LPC): LPC = AR model. The core of GSM mobile voice coding (RPE-LTP, 13 kbps) and CELP coding (e.g., AMR, G.729) is the AR(10)~AR(16) model. Vocoders use the AR model to separate glottal excitation from vocal tract resonance, forming the basis of speech synthesis and voice modification technology.
- High-resolution frequency estimation from short data: Seismic exploration reflection analysis with data windows of only 50~100 samples. The AR model can resolve multi-layer reflection frequency differences where FFT resolution is insufficient. Real example: 100 points @ 1 kHz data, FFT $\Delta f = 10\,\text{Hz}$, AR(20) successfully resolves two peaks separated by 5 Hz.
- Heart rate variability (HRV) frequency-domain analysis: ECG R-R interval series typically have only 300~500 data points (5-minute short-term HRV). AR(16)~AR(20) models can clearly resolve LF (0.04-0.15 Hz) and HF (0.15-0.4 Hz) components, much smoother and more stable than FFT. In clinical diagnosis, the LF/HF ratio is an important indicator of autonomic nervous system function.
Pitfalls and Limitations
- AR has only peaks, no valleys: The all-pole model can inherently only describe spectra with peaks. If the true spectrum has deep valleys (zeros), the AR model requires a very high order to approximate them, and the result is poor. In such cases, consider ARMA models (which have both poles and zeros).
- Order too low → missing peaks: AR(4) can have at most 2 spectral peaks. If there are actually 3 peaks, the third will be completely missed.
- Order too high → spurious peaks: AR(30) on 160 data points will overfit, producing nonexistent false peaks. Always use AIC/BIC as a safeguard.
- Worse than periodogram for broadband noise: The AR model assumes the spectrum is smooth (determined by a few poles). For a broadband flat noise floor, it actually performs worse than the direct periodogram.
- Non-stationary signals: AR assumes stationarity. For non-stationary signals, you need to apply AR in short windows, or use time-varying AR (e.g., Kalman filtering + AR).
When Not to Use?
- Data is long and only a general PSD is needed: Use Welch's method (Section 3.2) directly — simpler, more robust, no order selection needed
- Spectrum has prominent zeros (deep valleys): Consider ARMA models (but estimation is more complex and convergence is worse)
- Need super-resolution to resolve sinusoidal frequencies: AR model resolution is still limited by the signal's SNR → use MUSIC / ESPRIT (Section 3.4) instead
- Signal model is unclear: AR's advantage depends on "correct assumptions." If uncertain whether the signal suits an AR description, non-parametric methods (Welch / Multitaper) are safer
Interactive: AR Spectrum vs FFT
Signal contains two sinusoids (100 Hz + 105 Hz) + noise, with only 64 sample points. Compare the frequency resolution of the periodogram vs. the AR model. Adjust the AR order to observe the effect: too low misses peaks, too high produces spurious peaks.
References: [1] Burg, Maximum Entropy Spectral Analysis, Ph.D. dissertation, Stanford, 1975. [2] Kay & Marple, Spectrum Analysis — A Modern Perspective, Proc. IEEE, 1981. [3] Kay, Modern Spectral Estimation: Theory and Application, Prentice Hall, 1988. [4] Stoica & Moses, Spectral Analysis of Signals, Pearson, 2005. [5] Makhoul, Linear Prediction: A Tutorial Review, Proc. IEEE, 1975.
✅ Quick Check
Q1: What happens if the AR model order is too low? Too high?
Show answer
Too low: misses real spectral peaks. Too high: produces spurious peaks (overfitting). AIC/BIC is typically used for selection.
Q2: Why can the AR model not represent spectral "valleys"?
Show answer
Because AR is an all-pole model. Poles can only produce peaks; without zeros, valleys cannot be created. An ARMA model is needed.
3.4 MUSIC & ESPRIT — Subspace Frequency Estimation
Frequency estimation methods that surpass the Fourier resolution limit
Why does this matter? Because some scenarios (radar target resolution, array antenna localization) require resolving frequencies or angles spaced closer than a single FFT bin. Subspace methods are currently the only mainstream technique capable of super-resolution estimation.
Previously: Section 3.3's AR model has better resolution than FFT, but is still limited by model assumptions. Is there a completely different approach that can surpass the Fourier frequency resolution limit?
Learning Objectives
- Understand the decomposition into signal subspace and noise subspace
- Derive the principle of the MUSIC pseudospectrum
- Learn about ESPRIT's rotational invariance property
- Master signal number estimation (MDL/AIC) and limitation conditions of subspace methods
One-Sentence Summary
MUSIC can resolve two frequencies spaced closer than a single FFT bin — it is not a better FFT, but a completely different approach: separating the signal space from the noise space.
Pain Point: Two Frequencies the FFT Cannot Separate
You have two radar targets too close together, with Doppler frequencies of 100 Hz and 103 Hz. Sampling rate 1 kHz, and you captured only 64 data points. One FFT frequency bin = $f_s/N = 1000/64 = 15.6\,\text{Hz}$, while the two frequencies differ by only 3 Hz — the FFT shows a single wide peak and cannot tell whether it is one target or two.
Zero-pad to 1024 points? That only interpolates the frequency axis to make bins denser, but the sinc mainlobe width does not change — the two targets are still blurred within the same mainlobe.
AR model? It helps, but at low SNR it can easily produce bias or spurious peaks. You need a fundamentally different method.
Origin
Ralph O. Schmidt (1979) at ESL Inc. (a defense electronics company) developed the MUSIC (Multiple Signal Classification) algorithm. His original motivation was Direction of Arrival (DOA) estimation in radar and electronic warfare: multiple electromagnetic waves arrive at an antenna array from different directions — how to estimate each wave's arrival angle with high precision? Traditional beamforming methods are limited by the array aperture. Schmidt's breakthrough was exploiting the subspace structure of the signal autocorrelation matrix, bypassing the traditional resolution limit.
The MUSIC paper (IEEE Trans. AP, 1986) became one of the most cited papers in signal processing (Google Scholar > 12,000 citations). Schmidt received the IEEE 2000 Society Award for this work.
Richard Roy and Thomas Kailath (1986) at Stanford proposed ESPRIT (Estimation of Signal Parameters via Rotational Invariance Techniques). ESPRIT exploits the rotational invariance between two sub-arrays, without needing to search the pseudospectrum (as MUSIC does), directly obtaining frequency estimates from matrix eigenvalues. It is computationally faster and less sensitive to precise array calibration.
These two methods are collectively known as subspace methods, and are core tools for frequency/direction estimation in modern radar, communications, sonar, and other systems.
Principle: Intuition of Subspaces
Core intuition: Imagine you are in an $M$-dimensional space. The received data = signal + noise. If there are $p$ sinusoidal signals, they "live" in a $p$-dimensional subspace (the signal subspace). Noise is spread across the entire $M$-dimensional space.
Eigendecomposition of the autocorrelation matrix separates these two spaces: $p$ large eigenvalues correspond to eigenvectors spanning the signal subspace; the remaining $M-p$ small eigenvalues ($\approx \sigma^2$, noise power) correspond to eigenvectors spanning the noise subspace.
Key fact: The signal's steering vector is necessarily orthogonal to the noise subspace. So when you take a test vector $\mathbf{a}(\omega)$ and compute its inner product with the noise subspace, at the correct frequency the inner product is zero, and taking the reciprocal gives infinity — a sharp peak.
Mathematical Derivation
Assume we observe $p$ complex sinusoids plus white noise. Construct the $M \times M$ autocorrelation matrix:
$\mathbf{A} = [\mathbf{a}(\omega_1), \ldots, \mathbf{a}(\omega_p)]$: steering matrix, $\mathbf{S}$: signal covariance matrix
where the steering vector:
Expand: MUSIC pseudospectrum derivation
Eigendecompose $\mathbf{R}$:
$$\mathbf{R} = \sum_{i=1}^{M}\lambda_i\,\mathbf{e}_i\mathbf{e}_i^H = \underbrace{\sum_{i=1}^{p}(\lambda_i^s + \sigma^2)\,\mathbf{e}_i\mathbf{e}_i^H}_{\text{signal subspace}} + \underbrace{\sigma^2\sum_{i=p+1}^{M}\mathbf{e}_i\mathbf{e}_i^H}_{\text{noise subspace}}$$where $\lambda_1 \geq \cdots \geq \lambda_p > \lambda_{p+1} = \cdots = \lambda_M = \sigma^2$.
Let $\mathbf{E}_n = [\mathbf{e}_{p+1}, \ldots, \mathbf{e}_M]$ be the eigenvector matrix of the noise subspace.
Key property: $\mathbf{a}(\omega_i) \perp \mathbf{E}_n$ for all $i = 1, \ldots, p$.
Proof: Because $\mathbf{a}(\omega_i)$ is a column of $\mathbf{A}$, it lies in the signal subspace $\text{span}(\mathbf{e}_1, \ldots, \mathbf{e}_p)$, which is orthogonal to the noise subspace.
Therefore, define the MUSIC pseudospectrum:
$$P_{\text{MUSIC}}(\omega) = \frac{1}{\mathbf{a}^H(\omega)\,\mathbf{E}_n\mathbf{E}_n^H\,\mathbf{a}(\omega)}$$At $\omega = \omega_i$, the denominator $\mathbf{a}^H\mathbf{E}_n\mathbf{E}_n^H\mathbf{a} = \|\mathbf{E}_n^H\mathbf{a}\|^2 \to 0$, so $P_{\text{MUSIC}} \to \infty$ — producing a sharp peak.
Note: $P_{\text{MUSIC}}$ is not a true power spectral density, just an indicator function ("pseudospectrum"). Peak locations are meaningful (= frequency estimates), but peak heights have no power interpretation. $\;\blacksquare$
How to Use: Five-Step Process
Step 1: Build autocorrelation matrix $\mathbf{R}$ (size $M \times M$)
From data $x[0], \ldots, x[N-1]$, construct a Hankel matrix and estimate $\hat{\mathbf{R}} = \frac{1}{N-M+1}\sum_{n=0}^{N-M}\mathbf{x}_n\mathbf{x}_n^H$, where $\mathbf{x}_n = [x[n], x[n+1], \ldots, x[n+M-1]]^T$.
Choosing $M$: $M$ must be greater than the number of signals $p$. Rule of thumb: $M \approx N/3$ to $N/2$. $M$ too small → matrix too small, poor resolution; $M$ too large → too few snapshots for estimating $\hat{\mathbf{R}}$, inaccurate.
Step 2: Eigendecomposition
$\hat{\mathbf{R}} = \mathbf{E}\boldsymbol{\Lambda}\mathbf{E}^H$. Observe the eigenvalues: the first $p$ are significantly larger than the rest.
Step 3: Estimate the number of signals $p$
Look for the "cliff" in eigenvalues: $\lambda_1 \geq \cdots \geq \lambda_p \gg \lambda_{p+1} \approx \cdots \approx \lambda_M$. Or use MDL (Minimum Description Length) / AIC criteria for automatic determination:
$\text{MDL}(k) = -(N-M+1)(M-k)\ln\frac{\prod_{i=k+1}^{M}\lambda_i^{1/(M-k)}}{\frac{1}{M-k}\sum_{i=k+1}^{M}\lambda_i} + \frac{1}{2}k(2M-k)\ln(N-M+1)$
Choose $k$ that minimizes MDL as $\hat{p}$.
Step 4: Construct noise subspace and scan pseudospectrum
$\mathbf{E}_n = [\mathbf{e}_{p+1}, \ldots, \mathbf{e}_M]$; densely sample the frequency axis to compute $P_{\text{MUSIC}}(\omega)$.
Step 5: Find peaks → frequency estimates
Peak locations of $P_{\text{MUSIC}}$ = estimated frequencies. Parabolic interpolation can further refine the estimates.
Concrete example: Two closely spaced sinusoids
- Signal: $x[n] = \sin(2\pi \cdot 100\,n/f_s) + \sin(2\pi \cdot 103\,n/f_s) + \text{noise}$
- Sampling rate $f_s = 1000\,\text{Hz}$, only $N = 64$ data points
- FFT bin = $1000/64 = 15.6\,\text{Hz}$ → completely unresolvable (3 Hz << 15.6 Hz)
- MUSIC: choose $M = 16$, estimate $p = 2$
- Eigendecomposition → 2 large eigenvalues ($\gg \sigma^2$) → 14 small eigenvalues ($\approx \sigma^2$)
- Scan pseudospectrum → two clear, sharp peaks appear at 100 Hz and 103 Hz
- Successfully resolved! Resolution improved by $15.6/3 \approx 5$ times
ESPRIT: A More Efficient Alternative
Intuition: MUSIC needs to scan the entire frequency axis to search for peaks, which is computationally expensive. ESPRIT exploits a clever observation — if the first $M-1$ rows and last $M-1$ rows of the matrix are viewed as two "sub-arrays," the relationship between them is a rotation (phase shift), and the rotation amount = $e^{j\omega}$ directly gives the frequency!
ESPRIT Rotational Invariance
$$\mathbf{E}_{s2} = \mathbf{E}_{s1}\,\mathbf{\Phi}, \quad \mathbf{\Phi} = \text{diag}(e^{j\omega_1}, e^{j\omega_2}, \ldots, e^{j\omega_p})$$$\mathbf{E}_{s1}$, $\mathbf{E}_{s2}$: projections of the signal subspace onto the two sub-arrays
Implementation steps:
- Eigendecompose to get signal subspace $\mathbf{E}_s$ (same as MUSIC)
- Take $\mathbf{E}_{s1}$ = $\mathbf{E}_s$ with last row removed, $\mathbf{E}_{s2}$ = $\mathbf{E}_s$ with first row removed
- Compute $\mathbf{\Phi} = \mathbf{E}_{s1}^{\dagger}\mathbf{E}_{s2}$ (least squares / Total Least Squares)
- Eigenvalues of $\mathbf{\Phi}$: $\lambda_i = e^{j\omega_i}$ → $\omega_i = \angle\lambda_i$ → frequency $f_i = \omega_i f_s / (2\pi)$
| Property | MUSIC | ESPRIT |
|---|---|---|
| Output | Pseudospectrum (requires peak search) | Directly gives frequency values |
| Computational cost | $O(M^3) + O(M^2 \cdot N_{\text{scan}})$ | $O(M^3)$ (no search needed) |
| Array calibration | Requires precise calibration | Less sensitive |
| Resolution | Slightly higher (uses full noise subspace) | Slightly lower but still far exceeds FFT |
| Additional info | Pseudospectrum provides visualization | Only frequency values |
Application Scenarios
- Radar DOA estimation: Phased array radar with 8~64 antenna elements receives signals; MUSIC estimates the precise bearing angles of multiple targets. In practical systems, with a 64-element ULA (uniform linear array), MUSIC can resolve two targets separated by < 1 degree at SNR = 10 dB (conventional beamforming resolution is about 7 degrees).
- Wireless communication AoA positioning: 5G base stations use massive MIMO antenna arrays, employing ESPRIT/MUSIC to estimate the Angle of Arrival (AoA) of user devices, combining multi-base-station information for indoor positioning (accuracy < 1 m). ESPRIT is preferred in real-time systems due to its computational efficiency.
- Closely-spaced modal identification in vibration analysis: Mechanical structures (e.g., aircraft wings, bridges) have multiple vibration modes at natural frequencies, some extremely close (< 1 Hz apart). MUSIC can resolve these closely-spaced modes from short accelerometer data segments, used for structural health monitoring.
- Multiple pitch estimation in music signals: In a piano chord, the fundamental frequencies of multiple notes differ by less than one semitone (~6%), each with harmonics. MUSIC can precisely estimate the fundamental frequency of each note in a chord, serving as a tool for Automatic Music Transcription.
Pitfalls and Limitations
- Must know or estimate the number of signals $p$: This is the "Achilles' heel" of subspace methods. If $p$ is estimated incorrectly, the results are wrong. Overestimating $p$ → spurious peaks; underestimating $p$ → missed signals. The MDL criterion is reliable at high SNR, but prone to failure at low SNR.
- Correlated signals (coherent sources) cause failure: If two signals are fully correlated (e.g., reflections in multipath), $\mathbf{S}$ is rank-deficient, the signal subspace dimension is less than $p$, and MUSIC "leaks" some signals into the noise subspace. Solution: spatial smoothing — sacrificing some array aperture to restore the rank of $\mathbf{S}$.
- Computational cost $O(M^3)$: The eigendecomposition complexity. Already expensive at $M = 64$. Large-scale arrays ($M > 100$) require fast subspace tracking algorithms (e.g., PAST, GROUSE).
- Performance degrades sharply at low SNR: Subspace methods exhibit a "threshold effect" — when SNR falls below a threshold (typically 5~10 dB), performance degrades dramatically, with large estimation biases or completely wrong results. This is because noise and signal eigenvalues begin to overlap, and subspace separation fails.
- Only suitable for "few sinusoids + white noise" model: If the signal is broadband (e.g., speech) or the noise is colored (non-white), the assumptions of standard MUSIC/ESPRIT are violated and results are unreliable. Colored noise requires pre-whitening.
When Not to Use?
- Need a full PSD (not just discrete frequencies): MUSIC only estimates discrete frequency components, not a continuous PSD → use Welch or AR model
- Many signals and count unknown: More than $M/2$ signals cannot be handled (a fundamental limitation of subspace methods). In practice, $p > 5$~$8$ is already difficult
- Real-time processing with large matrices: $O(M^3)$ eigendecomposition may be too slow for embedded systems → consider ESPRIT (faster) or subspace tracking algorithms
- Need a robust "works anywhere" method: Subspace methods are sensitive to model assumptions (white noise, known signal count, uncorrelated) → Welch / Multitaper is safer
Interactive: MUSIC vs FFT
Two very closely spaced sinusoids (center frequency 100 Hz). Adjust frequency separation and SNR. When $\Delta f$ is much smaller than the FFT bin width, the FFT shows only one peak, but MUSIC can resolve two.
FFT bin width = 15.6 Hz. When $\Delta f < 15.6$ Hz, the FFT fundamentally cannot resolve them. Lower the SNR to observe the threshold effect.
References: [1] Schmidt, Multiple Emitter Location and Signal Parameter Estimation, IEEE Trans. AP, 1986 (Originalreport 1979). [2] Roy & Kailath, ESPRIT — Estimation of Signal Parameters via Rotational Invariance Techniques, IEEE Trans. ASSP, 1989. [3] Stoica & Moses, Spectral Analysis of Signals, Ch.4, Pearson, 2005. [4] Van Trees, Optimum Array Processing, Part IV of Detection, Estimation, and Modulation Theory, Wiley, 2002.
✅ Quick Check
Q1: What information does MUSIC need to work?
Show answer
It needs to know (or estimate) the number of signals p, in order to distinguish the signal subspace from the noise subspace. MDL or AIC criteria are typically used to estimate p.
Q2: What happens if two signals are fully correlated (coherent)?
Show answer
It fails — correlated signals cause the autocorrelation matrix to become rank-deficient, preventing correct subspace separation. Spatial smoothing preprocessing is required.
3.5 Chirp-Z Transform (CZT)
A spectral magnifying glass — computing the DFT along any path in the Z-plane
Why does this matter? Because sometimes you only care about the details of a narrow frequency band (e.g., precisely measuring power grid frequency deviation). CZT lets you focus computational resources on that band, like a spectral magnifying glass.
Previously: Section 3.4's MUSIC can do super-resolution estimation, but is computationally expensive. If you just want to "zoom in" on a frequency band's details, there is a lighter tool —
Learning Objectives
- Define the CZT: a generalized DFT sampled along a spiral in the Z-plane
- Derive the Bluestein identity and $O(N\log N)$ computation method
- Understand the CZT's "frequency zoom-in" capability and its difference from zero-padding
- Distinguish "denser frequency sampling" from "higher frequency resolution"
One-Sentence Summary
CZT lets you compute only the frequency range you care about, like a magnifying glass on the spectrum.
Pain Point: I Only Care About a Small Frequency Range
You are doing power system frequency monitoring and only care about the tiny frequency deviation between 49.9~50.1 Hz. Sampling rate $f_s = 1\,\text{kHz}$, you collected 1 second of data ($N = 1000$). The FFT gives you the full spectrum from 0~500 Hz, each bin = 1 Hz, with only one or two points near 50 Hz — far too coarse to see a 0.01 Hz frequency deviation.
You could zero-pad to $N = 100000$ (100-second equivalent), making the bin width 0.01 Hz, but that means computing a 100K-point FFT, of which 99.96% of the computed results (0~49.9 Hz and 50.1~500 Hz) you don't need at all.
Is there a way to compute only the 49.9~50.1 Hz range, but with a dense 0.01 Hz spacing?
Origin
Lawrence Rabiner, Ronald Schafer, and Charles Rader (1969) at Bell Labs proposed the Chirp-Z Transform. Their key insight was: the DFT is actually equi-spaced sampling of the Z-transform on the unit circle; if you move the sampling points to any spiral in the Z-plane, you get a more flexible frequency analysis tool.
The paper title directly describes the method's core: "The Chirp z-Transform Algorithm." "Chirp" refers to the linear frequency-modulated signal used in the algorithm, because in the Bluestein identity that converts the DFT into a convolution, the kernel function is precisely a chirp.
Another important contribution by Rader was discovering that prime-length DFTs can be converted into convolutions (Rader's algorithm, 1968), while the CZT's Bluestein method is more general — it converts DFTs of any length into convolutions (no requirement for the length to be a power of 2 or prime).
Principle
Intuition: The DFT consists of $N$ equi-spaced samples of the Z-transform on the unit circle. CZT generalizes these samples to $M$ equi-spaced points on any spiral in the Z-plane. If the spiral covers only the small arc (frequency range) you are interested in, you get arbitrarily dense spectral samples in that range — and $M$ can be much larger or smaller than $N$.
CZT Definition
$$X(z_k) = \sum_{n=0}^{N-1}x[n]\,z_k^{-n}, \quad z_k = A\,W^{-k}, \; k = 0, 1, \ldots, M-1$$$A = A_0\,e^{j\theta_0}$ (starting point), $W = W_0\,e^{j\phi_0}$ (step) define the spiral path in the Z-plane
DFT is a special case of CZT: $A = 1$, $W = e^{-j2\pi/N}$, $M = N$ ($N$ equi-spaced points on the unit circle).
Frequency zoom-in: Simply set $A = e^{j2\pi f_1/f_s}$ (start frequency), $W = e^{-j2\pi(f_2-f_1)/(Mf_s)}$ (frequency step), and the $M$-point CZT computes only the $M$-point spectrum within $[f_1, f_2]$.
Expand: Bluestein identity and convolution implementation
Direct computation of CZT is $O(NM)$, but the Bluestein identity converts it into a convolution:
Key: $kn = \frac{1}{2}[k^2 + n^2 - (k-n)^2]$. Therefore:
$$W^{kn} = W^{k^2/2}\,W^{n^2/2}\,W^{-(k-n)^2/2}$$Substituting into the CZT definition:
$$X(z_k) = W^{k^2/2}\sum_{n=0}^{N-1}\underbrace{\left[x[n]\,A^{-n}\,W^{n^2/2}\right]}_{g[n]}\,\underbrace{W^{-(k-n)^2/2}}_{h[k-n]}$$The summation inside the brackets is a linear convolution of $g[n]$ and $h[n] = W^{-n^2/2}$!
Therefore, the CZT can be computed with three FFTs:
- Compute $g[n] = x[n]\,A^{-n}\,W^{n^2/2}$, $n = 0, \ldots, N-1$
- Compute $h[n] = W^{-n^2/2}$, $n = -(N-1), \ldots, M-1$
- FFT convolution: $y = g * h$ (zero-pad to $\geq N+M-1$, FFT → pointwise multiply → IFFT)
- $X(z_k) = W^{k^2/2}\,y[k]$, $k = 0, \ldots, M-1$
Total complexity: $O((N+M)\log(N+M))$, much faster than the direct $O(NM)$. $\;\blacksquare$
How to Use
Just three parameters
- Start frequency $f_1$: lower bound of your frequency band of interest
- End frequency $f_2$: upper bound of your frequency band of interest
- Number of sample points $M$: how many frequency points between $[f_1, f_2]$
Frequency spacing = $(f_2 - f_1)/M$. You can make this spacing arbitrarily small — but remember, this does not increase the true frequency resolution.
Application Scenarios
- Power system frequency monitoring: The grid nominal frequency is 50/60 Hz, but the actual frequency fluctuates between 49.95~50.05 Hz. Monitoring this tiny deviation is critical for grid stability. CZT can achieve 0.001 Hz frequency measurement precision on 1 second of data (a direct FFT would need 1000 seconds to achieve the same bin density). IEC 61000-4-30 Class A power quality analyzers actually use CZT-like techniques.
- Musical instrument tuner: A4 = 440 Hz, A4# = 466.16 Hz, semitone difference = 26.16 Hz ($\approx$ 6%). Tuning requires precision < 1 cent (0.058%, i.e., 0.26 Hz). For 44.1 kHz sampled audio of 0.1 seconds ($N = 4410$), FFT bin = 10 Hz, completely insufficient. CZT zoom to 430~450 Hz, $M = 2000$ points → spacing 0.01 Hz → easily achieves 0.1 cent precision.
- Precision vibration analysis: Shaft vibration of rotating machinery (e.g., turbine generator, 3000 RPM = 50 Hz). Need to precisely track small changes in amplitude and phase of 1X (50 Hz) and 2X (100 Hz) with load/temperature. CZT focuses on the two narrow bands 49~51 Hz and 99~101 Hz, more flexible than order tracking.
Pitfalls and Limitations
- CZT does not increase true frequency resolution: This is the most important and most easily misunderstood point. Frequency resolution is still limited by the observation time $T$: $\Delta f_{\text{resolution}} \approx 1/T$. CZT only provides denser frequency sampling (like reading with finer markings on a ruler), but cannot resolve two frequencies spaced closer than $1/T$. This is similar to zero-padding, but CZT can be applied to only the band you care about, making it more flexible and less computationally expensive.
- Leave margin when choosing the $[f_1, f_2]$ range: If the zoom range is too narrow, you may truncate the mainlobe of the target frequency, causing sidelobe artifacts. Typically leave 2~3 mainlobe widths of margin at each end.
- Windowing is still required: CZT does not eliminate spectral leakage. Data should still be windowed before CZT.
- Numerical stability when $|W_0| \neq 1$ (spiral rather than arc): If the sampling path deviates too far from the unit circle, $z_k^{-n}$ grows or decays exponentially, causing numerical issues. Sampling on the unit circle ($|A_0| = |W_0| = 1$) is safest.
When Not to Use?
- Need full-range spectrum: If you need to see the complete spectrum from 0 to $f_s/2$, a direct FFT is faster and simpler
- Need to truly increase resolution (resolve close frequencies): CZT is a magnifying glass, not a microscope → use MUSIC / ESPRIT (Section 3.4) or collect longer data
- $M$ and $N$ are both large and similar: CZT's three FFTs plus preprocessing is slower than a single direct FFT → only worthwhile when the zoom range is significantly smaller than the full range
References: [1] Rabiner, Schafer & Rader, The Chirp z-Transform Algorithm, IEEE Trans. Audio Electroacoustics, 1969. [2] Bluestein, A Linear Filtering Approach to the Computation of Discrete Fourier Transform, IEEE Trans. AU, 1970. [3] Oppenheim & Schafer, Discrete-Time Signal Processing, Section 9.6.
Interactive: Chirp-Z Frequency Magnifying Glass
Two sinusoids only 1 Hz apart (99.5 Hz + 100.5 Hz). The FFT's Δf≈3.9 Hz cannot resolve them; the CZT can zoom into any frequency band and clearly separate the two peaks.
4.1 Hilbert Transform
Mathematical foundation for constructing the analytic signal — extracting amplitude envelope and instantaneous phase from real signals
Why does this matter? Because envelope detection is the core operation in communication demodulation (AM) and mechanical fault diagnosis (bearings). The Hilbert transform is the cleanest, most mathematically grounded method for extracting the envelope.
Previously: Part III taught various spectral estimation methods. But some applications need more than just the spectrum — they also need the "envelope," i.e., how the signal amplitude varies over time. The Hilbert transform is the mathematical tool for extracting the envelope.
Learning Objectives
- Define the Hilbert transform and its frequency-domain representation $-j\,\text{sgn}(\omega)$
- Understand the physical meaning of the analytic signal: eliminating negative frequency redundancy
- Master the three-step FFT-based Hilbert transform implementation
- Recognize the limitations and correct usage of the Hilbert transform
One-Sentence Summary
The Hilbert transform converts a real signal into a complex signal (the analytic signal), allowing you to extract the amplitude envelope and instantaneous frequency.
Pain Point: How to Extract What Is "Hidden Inside the Carrier"?
Scenario 1: AM radio. A station modulates speech (20~4000 Hz) onto a 1 MHz carrier and broadcasts it. How does the radio "strip" the speech off the carrier? The speech is the carrier's amplitude envelope, but how do you extract the envelope from a real-valued signal?
Scenario 2: Bearing fault detection. A bearing outer race has a small defect; each time a rolling element passes over the defect it produces an impact, and these impacts excite the bearing housing's high-frequency resonance (2~5 kHz). The time-domain waveform shows a series of impact pulses "modulated" by the 2~5 kHz resonance. What you want to find is the repetition frequency of the impacts (BPFO, about 87 Hz), but it is hidden within the high-frequency resonance. You need to extract the envelope first, then FFT the envelope to find the BPFO.
In both scenarios, envelope detection is the core operation, and the Hilbert transform is the cleanest envelope detection tool.
Origin
David Hilbert (1905) introduced a special class of singular integral transforms while studying complex analysis and integral equations. This purely mathematical tool had no connection to engineering at the time.
Dennis Gabor (1946) was the key figure who brought the Hilbert transform into signal processing. In his landmark paper "Theory of Communication" (published in the Journal of the IEE), Gabor proposed the concept of the "analytic signal": pairing a real signal $x(t)$ with its Hilbert transform $\hat{x}(t)$ as the imaginary part to form a complex signal $z(t) = x(t) + j\hat{x}(t)$.
Gabor's motivation was communication theory — he wanted to give rigorous mathematical definitions for a signal's "instantaneous frequency" and "instantaneous amplitude." The analytic signal provided this framework. Incidentally, the same paper also introduced what would later be called the Gabor transform (a special case of the short-time Fourier transform).
Gabor later received the 1971 Nobel Prize in Physics for inventing holography.
Principle
Intuition: A real signal's spectrum is conjugate-symmetric ($X(-\omega) = X^*(\omega)$), meaning positive and negative frequencies carry exactly the same information. The analytic signal removes the negative frequencies and doubles the positive ones — no information is lost, but the representation is more concise. Moreover, after removing negative frequencies, the signal becomes complex, allowing direct extraction of the envelope and instantaneous phase via magnitude and phase angle.
Hilbert Transform (Time-Domain Definition)
$$\hat{x}(t) = \mathcal{H}\{x(t)\} = \frac{1}{\pi}\,\text{p.v.}\!\int_{-\infty}^{\infty}\frac{x(\tau)}{t - \tau}\,d\tau = x(t) * \frac{1}{\pi t}$$p.v. = Cauchy principal value (avoiding the singularity at $\tau = t$)
Hilbert Transform (Frequency-Domain Representation)
$$\hat{X}(\omega) = -j\,\text{sgn}(\omega)\cdot X(\omega) = \begin{cases}-jX(\omega), & \omega > 0 \\ 0, & \omega = 0 \\ jX(\omega), & \omega < 0\end{cases}$$Effect: Positive frequency components are phase-shifted by $-90°$, negative frequency components by $+90°$, with magnitude unchanged. It is an allpass phase shifter.
Expand: Why does $-j\,\text{sgn}(\omega)$ equal a 90-degree phase shift?
$-j = e^{-j\pi/2}$, so multiplying by $-j$ subtracts 90 degrees from the phase.
For positive frequency components $X(\omega)$ ($\omega > 0$): $\hat{X}(\omega) = -jX(\omega)$, phase shift $-90°$.
For negative frequency components ($\omega < 0$): $\text{sgn}(\omega) = -1$, so $\hat{X}(\omega) = jX(\omega)$, phase shift $+90°$.
Hilbert transform of $\cos(\omega_0 t)$:
$$\mathcal{H}\{\cos(\omega_0 t)\} = \sin(\omega_0 t)$$Because each positive frequency component of $\cos$ is phase-shifted by $-90°$: $\cos(\omega_0 t - 90°) = \sin(\omega_0 t)$. $\;\blacksquare$
Analytic Signal
Definition
$$z(t) = x(t) + j\,\hat{x}(t)$$Spectrum of the analytic signal:
Only positive frequencies — negative frequencies are completely eliminated. This is the meaning of "analytic": in complex analysis, an analytic function's Fourier transform exists only in a half-plane.
Concrete example: $x(t) = A\cos(\omega_0 t + \phi)$
$\hat{x}(t) = A\sin(\omega_0 t + \phi)$
$z(t) = Ae^{j(\omega_0 t + \phi)}$
$|z(t)| = A$ (constant envelope), $\angle z(t) = \omega_0 t + \phi$ (linear phase).
FFT-based Hilbert Transform: Three Steps
This is the most common implementation in practice (and what MATLAB's hilbert() and SciPy's scipy.signal.hilbert() do internally):
How to Use
Envelope Detection Workflow
- (Critical!) Bandpass filter: First restrict the signal to the narrow frequency band of interest. Computing the Hilbert envelope without filtering first typically yields physically meaningless results.
- Compute Hilbert transform / analytic signal: Use the FFT three-step method above.
- Take magnitude = envelope: $A(t) = |z(t)| = \sqrt{x^2(t) + \hat{x}^2(t)}$
Concrete example: AM demodulation
$x(t) = [1 + 0.8\cos(2\pi \cdot 5\,t)] \cdot \cos(2\pi \cdot 1000\,t)$
Carrier 1000 Hz, modulating wave 5 Hz (modulation depth 80%)
- Directly compute the Hilbert envelope $|z(t)|$
- The envelope perfectly recovers $1 + 0.8\cos(2\pi \cdot 5\,t)$ — the modulating waveform
- This is the principle of AM envelope detection!
Python Example: Extracting the Envelope with scipy.signal.hilbert
Application Scenarios
- Communication demodulation (AM / SSB): AM envelope detection as described above. SSB (single-sideband modulation) uses the Hilbert transform to eliminate half the bandwidth: $x_{\text{SSB}}(t) = x(t)\cos(\omega_c t) \mp \hat{x}(t)\sin(\omega_c t)$ (upper/lower sideband). SSB bandwidth is half that of AM, and is the standard modulation scheme for amateur radio and some military communications.
- Bearing fault envelope spectrum analysis: Accelerometer measures bearing vibration → bandpass filter (focus on 2~5 kHz resonance band) → Hilbert envelope → envelope FFT → find BPFO = 87.3 Hz and its 2X (174.6 Hz), 3X (261.9 Hz) harmonics in the envelope spectrum. ISO 13373-3 and major vibration analysis software (e.g., SKF Microlog, B&K Pulse) all use this as their standard procedure.
- Speech pitch tracking: Bandpass filter speech signal to the fundamental frequency range (80~400 Hz) → Hilbert envelope → autocorrelation of the envelope → first non-zero peak = fundamental period $T_0$ → fundamental frequency $F_0 = 1/T_0$.
- Seismic wave analysis: The envelope (instantaneous amplitude) of seismic signals is used to determine P-wave and S-wave arrival times, as well as epicenter distance estimation.
Pitfalls and Limitations
- Hilbert envelope of broadband signals has no physical meaning: If the signal is not narrowband (i.e., cannot be written as $x(t) \approx A(t)\cos[\omega_c t + \phi(t)]$), the envelope $|z(t)|$ will fluctuate wildly and be difficult to interpret. You must bandpass filter first to make the signal narrowband before computing the Hilbert envelope. This is the most important usage guideline.
- Edge effect: The FFT performs circular convolution, producing artifacts at the signal's beginning and end. Solutions: (a) extend the data at both ends (mirroring or zero-padding), then trim the middle after computation; (b) use overlap-add segmented processing.
- Discrete-time approximation: The continuous Hilbert transform is non-causal ($1/(\pi t)$ has values at $t < 0$). The FFT implementation is a finite-length approximation. When the signal bandwidth approaches the Nyquist frequency, the approximation quality degrades.
- DC component issue: If the signal has a DC offset, the DC part remains in the real part but not the imaginary part after the Hilbert transform, causing a bias in the envelope estimate. Remove the mean before computing the Hilbert transform.
When Not to Use?
- Signal has multiple components with different carriers: The Hilbert envelope will be a mixed envelope of all components, unable to separate them → bandpass filter to separate components first, or use EMD/HHT (Section 5.6)
- Need time-frequency analysis (not just the envelope): Hilbert only provides a one-dimensional envelope and instantaneous frequency, not a full time-frequency distribution → use STFT (Section 5.1) or CWT (Section 5.4)
- Envelope detection in broadband noise: Filter first, then Hilbert. Without filtering, the envelope detector will track the random envelope of the noise, yielding no useful information
Interactive: Envelope Detection
AM modulated signal $x(t) = [1 + m\cos(2\pi f_m t)]\cos(2\pi \cdot 1000\,t)$. The Hilbert envelope (orange) perfectly tracks the modulating waveform. Adjust modulation frequency $f_m$ to observe the effect.
References: [1] Gabor, Theory of Communication, J. IEE, 1946. [2] Hahn, Hilbert Transforms in Signal Processing, Artech House, 1996. [3] Marple, Computing the Discrete-Time Analytic Signal via FFT, IEEE Trans. SP, 1999. [4] Feldman, Hilbert Transform Applications in Mechanical Vibration, Wiley, 2011.
✅ Quick Check
Q1: What does the Hilbert transform do in the frequency domain?
Show answer
Positive frequencies are phase-shifted by -90 degrees, negative frequencies by +90 degrees, with magnitude unchanged. Equivalent to multiplying by -j·sgn(ω).
Q2: Why must you bandpass filter before computing the Hilbert envelope?
Show answer
Because the envelope of a broadband signal has no physical meaning. Bandpass filtering makes the signal narrowband, so the envelope can correctly reflect modulation characteristics.
4.2 Envelope & Instantaneous Frequency
Extracting time-varying amplitude and frequency from the analytic signal — a core tool for mechanical fault diagnosis
Why does this matter? Because for rotating machinery problems like bearing faults and gear defects, the fault signatures are not directly visible in the spectrum — they are hidden in the "envelope" of high-frequency resonances. Envelope spectrum analysis is the standard tool for industrial predictive maintenance.
Previously: Section 4.1 introduced the mathematics of the Hilbert transform. Now let us look at its most important application: extracting the envelope and instantaneous frequency from the analytic signal, which is the key technique for bearing fault diagnosis.
Learning Objectives
- Derive envelope and instantaneous frequency from the polar representation of the analytic signal
- Understand the limitation that instantaneous frequency has physical meaning only for narrowband signals
- Master the complete 6-step bearing fault envelope spectrum analysis workflow
- Compare Hilbert envelope with traditional rectification + low-pass filtering methods
One-Sentence Summary
The magnitude of the analytic signal is the envelope, and the derivative of its phase is the instantaneous frequency — letting you track how a signal's amplitude and frequency change over time.
Pain Point
"I want to know the frequency of this chirp signal at every instant." A linear chirp sweeps from 100 Hz to 1000 Hz, but the standard FFT only tells you "the signal contains components from 100~1000 Hz" — it does not tell you which time instant corresponds to which frequency.
"What is the repetition frequency of the bearing impacts?" The vibration signal measured by the accelerometer looks like a blob of high-frequency noise — the impact pattern is invisible to the eye. But if you can extract the envelope, the impact pattern emerges, and then the envelope spectrum reveals the characteristic frequency.
Origin
The concept of instantaneous frequency emerged naturally within Gabor's (1946) framework: the time derivative of the analytic signal $z(t) = A(t)e^{j\phi(t)}$'s phase, $\frac{1}{2\pi}\frac{d\phi}{dt}$, is the instantaneous frequency. However, whether "instantaneous frequency" has physical meaning sparked a long-standing academic debate.
Ville (1948) rigorously proved that: for narrowband signals, Gabor's instantaneous frequency equals the conditional expectation (first moment of frequency) of the Wigner-Ville distribution, thus having clear physical meaning. But for broadband signals, the instantaneous frequency can be negative or even infinite — losing the intuitive meaning of frequency.
Envelope spectrum analysis as a mechanical fault diagnosis tool was systematically developed by Robert B. Randall and others in the 1980s~1990s, becoming the international standard method for rotating machinery condition monitoring (ISO 13373, ISO 10816 series).
Principle
Write the analytic signal in polar form:
$z(t) = A(t)\,e^{j\phi(t)}$
$$A(t) = |z(t)| = \sqrt{x^2(t) + \hat{x}^2(t)} \quad \text{(instantaneous amplitude / envelope)}$$ $$\phi(t) = \arg[z(t)] = \arctan\frac{\hat{x}(t)}{x(t)} \quad \text{(instantaneous phase)}$$ $$f_i(t) = \frac{1}{2\pi}\frac{d\phi}{dt} \quad \text{(instantaneous frequency)}$$Intuition: Imagine the signal as a rotating vector (phasor). At each instant, the vector's length = envelope $A(t)$, the vector's angle = phase $\phi(t)$, the vector's rotation speed = instantaneous angular frequency $2\pi f_i(t)$.
Expand: Computing discrete-time instantaneous frequency
In discrete time, the phase is $\phi[n] = \arctan(\hat{x}[n]/x[n])$. The instantaneous frequency is approximated by differencing:
$$f_i[n] = \frac{f_s}{2\pi}\,\Delta\phi[n] = \frac{f_s}{2\pi}\left(\phi[n] - \phi[n-1]\right)$$Note: The output of $\arctan$ is in the range $(-\pi, \pi]$, and the phase difference between adjacent samples may "jump" by $\pm 2\pi$ (phase wrapping). You must perform phase unwrapping before differencing:
$$\Delta\phi_{\text{unwrapped}}[n] = \text{wrap}(\phi[n] - \phi[n-1]) = \Delta\phi[n] - 2\pi\,\text{round}\!\left(\frac{\Delta\phi[n]}{2\pi}\right)$$Or a more robust method — compute directly from the analytic signal's real and imaginary parts:
$$f_i[n] = \frac{f_s}{2\pi}\,\text{Im}\!\left(\frac{z[n]\,z^*[n-1]}{|z[n]\,z^*[n-1]|}\right) \cdot \frac{1}{\Delta t}$$$\blacksquare$
How to Use: Complete Bearing Fault Envelope Spectrum Analysis Workflow
This is one of the most important spectral analysis techniques in rotating machinery condition monitoring. Below is the complete 6-step industrial workflow with specific values:
Step 1: Acquire accelerometer data
Sampling rate $f_s = 25.6\,\text{kHz}$ (common vibration analysis sampling rate, covering up to 10 kHz). Acquire 2 seconds of data → $N = 51200$ points.
Step 2: Bandpass filter (focus on resonance band)
This is the most critical step. Bearing impacts excite structural resonances, with energy concentrated in a high-frequency band (typically 2~10 kHz, depending on the bearing and structure).
Design a bandpass filter: center frequency 3.5 kHz, bandwidth 2~5 kHz. Use a 4th-order Butterworth bandpass filter. Spectral kurtosis can be used to automatically select the optimal filter band (Antoni 2006).
Step 3: Hilbert envelope detection
$A[n] = |z[n]| = \sqrt{x_{\text{filtered}}^2[n] + \hat{x}_{\text{filtered}}^2[n]}$
Result: a series of pulses, each corresponding to a rolling element passing over the defect. Pulse spacing = $1/\text{BPFO}$.
Step 4: Low-pass filter the envelope (remove carrier residuals)
The envelope may still contain some high-frequency residuals (from the carrier). Use a low-pass filter with cutoff frequency $\approx 500\,\text{Hz}$ (well above the expected fault frequency, but well below the carrier frequency).
Step 5: Envelope FFT (envelope spectrum)
FFT the low-pass filtered envelope (using a Hann window). Frequency resolution $\Delta f = f_s/N = 25600/51200 = 0.5\,\text{Hz}$ (sufficient to resolve harmonics of the fault frequency).
Step 6: Search for characteristic frequencies in the envelope spectrum
Bearing fault characteristic frequencies (SKF 6205 bearing at 1797 RPM as example):
| Fault Type | Characteristic Frequency | Value (Hz) | Pattern in Envelope Spectrum |
|---|---|---|---|
| Outer race (BPFO) | $n_b f_r(1-d/D)/2$ | 87.3 | 87.3, 174.6, 261.9 Hz |
| Inner race (BPFI) | $n_b f_r(1+d/D)/2$ | 162.2 | 162.2 Hz $\pm f_r$ sidebands |
| Rolling element (BSF) | $Df_r(1-(d/D)^2)/(2d)$ | 69.6 | $2\times$ BSF + sidebands |
$n_b$=9 (number of rolling elements), $f_r$=29.95 Hz (rotational frequency), $d/D$=0.3348 (rolling element diameter / pitch circle diameter)
Interpretation guidelines: Seeing BPFO and its 2X, 3X harmonics in the envelope spectrum → outer race fault. Seeing BPFI with $\pm f_r$ sidebands → inner race fault (the inner race rotates with the shaft; the fault entering and leaving the load zone produces modulation). The number and relative height of harmonics indicate fault severity.
Application Scenarios
- Rotating machinery condition monitoring: Wind turbine gearbox and generator bearings (bearing replacement cost per turbine: $150,000~$300,000 + downtime losses). Hourly automatic envelope spectrum analysis, trending BPFO/BPFI energy. Early warning can reduce downtime from weeks to planned days. Globally adopted by wind farms (e.g., Bruel & Kjaer Vibro, SKF Enlight).
- Speech pitch tracking: Bandpass speech signal to 50~500 Hz → Hilbert envelope → autocorrelation or envelope spectrum → fundamental frequency $F_0$. Instantaneous frequency is more direct: bandpass near the fundamental → instantaneous frequency $f_i(t)$ gives real-time intonation changes. Used for prosody analysis in speech synthesis and singing pitch detection.
- Ultrasonic non-destructive testing (NDT): Ultrasonic pulses (center frequency 5 MHz) reflect inside materials. Envelope detection extracts the amplitude variation of echoes with depth (A-scan display). Envelope peak location = defect depth, peak magnitude ∝ defect reflection coefficient. Standard inspection method in aerospace, nuclear, and petrochemical industries.
Pitfalls and Limitations
- Instantaneous frequency has physical meaning only for narrowband signals: This limitation is worth repeating. If the signal's bandwidth is comparable to its center frequency (i.e., $BW \approx f_c$), the instantaneous frequency will exhibit rapid fluctuations or even negative values, completely losing any "frequency" meaning. Rule of thumb: instantaneous frequency is reliable only when $BW / f_c < 0.3$.
- Computing the envelope without bandpass filtering first → meaningless results: If the signal contains components in multiple frequency bands (e.g., shaft frequency, mesh frequency, and resonance frequency simultaneously in bearing vibration), the Hilbert envelope will be a mixture of all components, unable to isolate the desired fault signature.
- Phase unwrapping difficulties: In low-SNR regions or when the envelope is near zero, the phase estimate jumps wildly, causing spike-like false values in instantaneous frequency. Countermeasures: (a) low-pass filter the instantaneous frequency; (b) only trust instantaneous frequency when the envelope value is sufficiently large.
- Hilbert envelope vs. rectification + low-pass: Traditional "rectification + low-pass filtering" can also do envelope detection, but the Hilbert method's advantage is that no low-pass cutoff frequency needs to be chosen (the envelope is generated automatically), and it provides more accurate envelope estimates for narrowband signals. The drawback is that the FFT method requires the entire data segment (non-causal), making it unsuitable for real-time processing.
When Not to Use?
- Need a full time-frequency distribution (not just an envelope curve): Use STFT spectrogram (Section 5.1) or CWT scalogram (Section 5.4)
- Signal is highly non-stationary with multiple time-varying components: Use EMD / HHT (Section 5.6), which adaptively decomposes into multiple IMFs, each then analyzed with Hilbert envelope and instantaneous frequency
- Real-time (causal) envelope detection: The Hilbert FFT method is non-causal (requires the entire data segment). Real-time systems can use FIR Hilbert filters (with delay) or simple rectification + low-pass filtering
References: [1] Gabor, Theory of Communication, J. IEE, 1946. [2] Randall & Antoni, Rolling Element Bearing Diagnostics — A Tutorial, Mech. Sys. Sig. Proc., 2011. [3] Antoni, The Spectral Kurtosis: A Useful Tool for Characterising Non-Stationary Signals, Mech. Sys. Sig. Proc., 2006. [4] Boashash, Estimating and Interpreting the Instantaneous Frequency of a Signal, Proc. IEEE, 1992.
📝 Worked Example
CNC spindle bearing SKF 6205 (n=9, d=7.94mm, D=38.5mm), speed 3600 RPM. (a) Compute BPFO. (b) How to choose the bandpass filter band? (c) What frequency resolution is needed for the envelope spectrum?
Show solution
(a) fr=60Hz, BPFO = (9/2)×60×(1−7.94/38.5) = 214.2 Hz
(b) Choose the high-frequency resonance band, typically 2-8 kHz (determined by the measured frequency response function)
(c) To see BPFO≈214 Hz and its 2x=428 Hz, Δf must be at least <5 Hz → observation time ≥0.2 seconds
Interactive: Bearing Fault Envelope Spectrum Analysis
Periodic impacts from bearing faults excite structural resonances, but clear fault frequencies are not visible in the raw FFT. Through envelope analysis (bandpass → Hilbert envelope → envelope spectrum), the hidden BPFO modulation frequency can be extracted.
4.3 Cepstrum Analysis
The spectrum of the spectrum — discovering hidden periodic structures in the "quefrency" domain
Why does this matter? Because when the spectrum contains periodic patterns (harmonic families, sideband families, echoes), the cepstrum can reveal that periodicity at a glance — it is a classic tool for speech pitch detection and gearbox analysis.
Previously: Section 4.2 used envelope analysis to find periodic impacts in the time domain. But sometimes the periodicity is in the frequency domain rather than the time domain — the spectrum has a set of equi-spaced peaks (harmonic families, sideband families). The cepstrum is the tool for analyzing "periodicity in the spectrum."
Learning Objectives
- Understand the cepstrum definition: real/power cepstrum vs. complex cepstrum
- Establish the relationship between quefrency-axis peaks and periodic structures in the spectrum
- Master cepstrum applications in speech pitch detection and gearbox fault analysis
- Recognize numerical issues arising from logarithmic operations and phase unwrapping
One-Sentence Summary
Cepstrum = the spectrum of the spectrum. When the spectrum contains periodic patterns (such as equi-spaced sidebands or harmonic families), the cepstrum helps you find that period.
Pain Point: Periodic Patterns in the Spectrum
Scenario 1: Gearbox fault. The vibration spectrum from gear meshing is centered on the Gear Mesh Frequency (GMF), with an entire family of equi-spaced sidebands on both sides, spaced at the shaft rotation frequency. For example, GMF = 600 Hz, rotation speed = 30 Hz — you see peaks at 510, 540, 570, 600, 630, 660, 690 Hz. The human eye can see they are equi-spaced (30 Hz apart), but automatically detecting "what is the spacing of these peaks" algorithmically is not easy.
Scenario 2: Speech pitch detection. The speech signal's spectrum has a set of harmonics: $F_0, 2F_0, 3F_0, \ldots$, where $F_0$ is the fundamental frequency (male $\approx$ 100 Hz, female $\approx$ 200 Hz). These harmonics form an equi-spaced pattern in the spectrum, with spacing = $F_0$. You need a method to automatically find this equi-spacing.
The cepstrum is designed precisely for this: it applies another Fourier transform to the spectrum, converting "periodic patterns in the spectrum" into "peaks in the cepstrum."
Origin
B.P. Bogert, M.J.R. Healy, and the renowned statistician John W. Tukey (1963) proposed the cepstrum during a seismic wave analysis study at Bell Labs. Their original motivation was to detect echoes in seismic signals: seismic waves reflected by geological layers produce delayed copies. In the spectrum, this manifests as periodic ripples, with the ripple frequency = the reciprocal of the echo delay.
Tukey, in his characteristic humorous style, reversed the letters of all related terms:
| Original | Reversed | Meaning |
|---|---|---|
| spectrum | cepstrum | Spectrum of the spectrum |
| frequency | quefrency | Cepstrum horizontal axis (unit: time) |
| harmonics | rahmonics | "Harmonics" in the cepstrum |
| filtering | liftering | Filtering operation in the cepstrum domain |
The paper title itself is full of Tukey's style: "The Quefrency Alanysis of Time Series for Echoes: Cepstrum, Pseudo-Autocovariance, Cross-Cepstrum, and Saphe Cracking". Yes, even "alanysis" is an intentional letter-reversal of "analysis."
The cepstrum later found its broadest applications in speech processing and mechanical fault diagnosis. In speech processing, it is the foundation of MFCC (Mel-Frequency Cepstral Coefficients) — one of the most important features in speech recognition.
Principle
Intuition: If the spectrum has equi-spaced peaks (spacing $\Delta f$), it is as if the spectrum has a "frequency" $= \Delta f$. Applying another FFT to the spectrum reveals this "frequency" — it appears as a peak at quefrency = $1/\Delta f$.
Real/Power Cepstrum
$$c[n] = \text{IFFT}\left\{\log\left|X[k]\right|\right\} = \text{IFFT}\left\{\log\left|\text{FFT}\{x[n]\}\right|\right\}$$Complex Cepstrum
$$\hat{c}[n] = \text{IFFT}\left\{\log X[k]\right\} = \text{IFFT}\left\{\log|X[k]| + j\,\angle X[k]\right\}$$Meaning of the quefrency axis:
- The unit of quefrency is time (seconds or sample counts)
- A peak at quefrency $= \tau$ → the spectrum has a periodic structure with spacing $= 1/\tau$ Hz
- Peak height ∝ strength of the periodic structure
Expand: Why take the log?
The cepstrum's design has a deep reason: homomorphic deconvolution.
Many signals can be modeled as the convolution of two components:
$$x[n] = s[n] * h[n] \quad \Longleftrightarrow \quad X[k] = S[k] \cdot H[k]$$Take the logarithm:
$$\log|X[k]| = \log|S[k]| + \log|H[k]|$$Convolution becomes multiplication in the frequency domain; after taking the log, it becomes addition. Then apply IFFT:
$$c_x[n] = c_s[n] + c_h[n]$$If $c_s$ and $c_h$ occupy different regions in the quefrency domain, liftering (windowing/filtering in the cepstral domain) can separate them.
Speech example: Speech $x = e * v$ (glottal excitation $e$ convolved with vocal tract impulse response $v$).
- Vocal tract $v$'s cepstrum is concentrated at low quefrency ($< 3$ ms) → spectral envelope (formants)
- Excitation $e$'s cepstrum has a peak at high quefrency ($= T_0 \approx 5$~$10$ ms) → fundamental frequency
Apply a low-time lifter to retain low quefrency → keeps only the vocal tract response → formants can be extracted.
Find peaks at high quefrency → fundamental frequency estimation. $\;\blacksquare$
How to Use: Three Steps
Step 1: FFT → magnitude → log → IFFT → real cepstrum
$c[n] = \text{IFFT}\{\log|X[k]|\}$. Note that $|X[k]|$ may have zero values → add a small constant $\varepsilon$ to avoid $\log(0)$: $\log(|X[k]| + \varepsilon)$.
Step 2: Search for peaks on the quefrency axis
Ignore DC/low-frequency components near quefrency $\approx 0$. Search for peaks within a reasonable quefrency range (based on expected periodic structures).
Step 3: Peak location $\tau$ → corresponds to periodic structure spacing $1/\tau$ in the spectrum
Confirm results by checking for "rahmonics" of the peak ($2\tau, 3\tau, \ldots$).
Concrete example: Gearbox fault detection
- Gearbox: number of teeth $z = 20$, rotational speed $N_r = 1800\,\text{RPM} = 30\,\text{Hz}$
- Gear mesh frequency GMF = $z \times f_r = 20 \times 30 = 600\,\text{Hz}$
- If the gear has a localized defect → the spectrum shows equi-spaced sidebands around 600 Hz: $\ldots, 540, 570, 600, 630, 660, \ldots$ Hz
- Sideband spacing = $30\,\text{Hz}$ (= rotational frequency)
- The cepstrum shows a clear peak at quefrency $= 1/30 = 33.3\,\text{ms}$
- Plus smaller peaks at $66.7\,\text{ms}$ (2X) and $100\,\text{ms}$ (3X) (rahmonics)
- Conclusion: the spectrum has a periodic structure with 30 Hz spacing → points to gear fault with rotational speed as the modulation frequency
Application Scenarios
- Speech pitch detection: The speech signal's spectrum has a series of harmonics $F_0, 2F_0, \ldots, nF_0$. The cepstrum shows a peak at quefrency $= 1/F_0$. Male $F_0 \approx 100\,\text{Hz}$ → peak at 10 ms; female $F_0 \approx 200\,\text{Hz}$ → peak at 5 ms. This is one of the classic pitch detection methods (alongside autocorrelation), widely used in speech coding (G.729) and music analysis. MFCC features are a direct descendant of cepstrum analysis.
- Gearbox fault sideband analysis: As in the example above. The cepstrum can automatically extract sideband spacing from complex spectra without manually identifying spectral peak patterns. Industrial software such as B&K PULSE and SKF @ptitude have built-in cepstrum analysis. ISO 13373-9 specifically standardizes cepstrum application in gearbox diagnostics.
- Echo detection and removal: If the signal $y[n] = x[n] + \alpha x[n-D]$ (original signal plus an echo delayed by $D$), the spectrum $|Y| = |X|\cdot|1 + \alpha e^{-j\omega D}|$. $\log|Y| = \log|X| + \log|1 + \alpha e^{-j\omega D}|$. The second term is periodic with period $2\pi/D$ → the cepstrum shows a peak at quefrency $= D$, precisely locating the echo delay. Used in audio post-production and telephone echo cancellation.
- Mechanical fault transmission path isolation: Vibration signal $X = S \cdot H$ (excitation source $S$ × transfer path $H$). In the cepstrum, $c_x = c_s + c_h$. The periodic features of the excitation source (e.g., bearing fault frequency) appear at specific quefrencies, while the transfer path's influence (structural resonance envelope) is concentrated at low quefrency. Liftering can separate them, making fault signatures clearer.
Pitfalls and Limitations
- $\log(0)$ problem: If $|X[k]| = 0$ (spectrum has zero values), $\log(0) = -\infty$. In practice, a small constant must be added: $\log(|X[k]| + \varepsilon)$, where $\varepsilon \approx 10^{-10}$ ~ $10^{-12}$. Too large an $\varepsilon$ will distort the cepstrum shape.
- Phase unwrapping for the complex cepstrum is very tricky: The complex cepstrum requires $\log X = \log|X| + j\angle X$, where $\angle X$ must be continuous (unwrapped phase). But discrete spectrum phase is only defined in $(-\pi, \pi]$, and unwrapping algorithms frequently fail in the presence of noise. In practice, usually only the real cepstrum (power cepstrum) is used, avoiding the phase problem.
- Frequency resolution vs. quefrency resolution trade-off: The cepstrum's quefrency resolution = $1/f_s$ (one sample interval). To resolve two peaks with very close quefrencies (i.e., two close spectral periodicities), very high frequency resolution (long FFT) is needed, which in turn requires long data.
- The real cepstrum is an even function: $c[n] = c[-n]$ (because after taking log the result is real, so the IFFT output is symmetric). Only the $n > 0$ portion needs to be examined.
- Not suitable for non-periodic spectral features: The cepstrum's strength is detecting "equi-spaced patterns in the spectrum." If the fault spectrum is not equi-spaced sidebands but rather broadband elevation or a single peak shift, the cepstrum is less useful than directly examining the spectrum.
When Not to Use?
- No periodic patterns in the spectrum: The cepstrum's advantage lies in detecting "spectral periodicity." If your analysis target is a single frequency peak or broadband noise characteristics, examining the spectrum or PSD directly is more effective
- Need precise power/energy measurements: The logarithmic operation destroys the linear power relationship. If you need precise spectral energy values → use Welch PSD (Section 3.2)
- Analyzing bearing faults (not gearbox): Bearing faults are usually more directly and effectively analyzed with the envelope spectrum (Section 4.2). The cepstrum is better suited for gearboxes (since gear sidebands are typical equi-spaced patterns)
- Real-time speech pitch tracking: The cepstrum method requires the FFT of an entire speech segment, with significant delay. Real-time systems more commonly use autocorrelation or the YIN algorithm
References: [1] Bogert, Healy & Tukey, The Quefrency Alanysis of Time Series for Echoes: Cepstrum, Pseudo-Autocovariance, Cross-Cepstrum, and Saphe Cracking, Proc. Symposium on Time Series Analysis, 1963. [2] Oppenheim & Schafer, Discrete-Time Signal Processing, Ch.13 (Homomorphic Signal Processing). [3] Randall, Vibration-based Condition Monitoring: Industrial, Automotive and Aerospace Applications, Wiley, 2011. [4] Noll, Cepstrum Pitch Determination, JASA, 1967.
Interactive: Cepstrum Echo Detection
Original signal plus an echo delayed by D. The cepstrum shows a peak at quefrency=D.
5.1 Short-Time Fourier Transform (STFT)
The cornerstone of time-frequency analysis — sliding-window FFT
Why does this matter? Because the frequency content of most real-world signals changes over time. FFT only tells you "which frequencies are present" but not "when they appear." STFT is the most fundamental time-frequency analysis tool and the baseline for all advanced methods.
Previously: The analysis methods in Part IV assumed that the signal's frequency characteristics do not change over time. But real signals are usually non-stationary. STFT lets you see "when" and "what frequency" simultaneously.
Learning Objectives
- Understand the core idea of STFT: windowed segmentation + FFT, bringing time information into frequency analysis
- Master the selection logic for three key parameters: window length, overlap ratio, and NFFT
- Interpret typical patterns in Spectrograms
- Understand the fundamental limitation of Heisenberg's uncertainty principle on STFT resolution
One-Sentence Summary
STFT is "sliding-window FFT" — it lets you see how frequency content changes over time. Cut a long signal into many short segments, perform FFT on each segment, then arrange the results along the time axis to obtain a Spectrogram.
Pain Point: FFT Loses "Time"
Standard FFT tells you "which frequency components are in the signal," but tells you nothing about when those components appear. For time-varying signals, this is a fatal flaw:
- Speech: When transitioning from vowel /a/ to /i/, the formant frequencies change dramatically within 50 ms. FFT simply mixes all frequencies together
- Engine vibration: Accelerating from idle at 800 rpm to 6000 rpm, the dominant vibration frequency rises continuously. FFT only tells you "frequencies from 800 to 6000 rpm are all present"
- Seismic waves: P-waves (compressional, high-frequency, arrive first) and S-waves (shear, low-frequency, arrive later) require simultaneous analysis of arrival time and frequency characteristics
- Music: The onset/offset times, pitch changes, and vibrato of each note in a melody all require simultaneous time + frequency analysis
Fundamental problem: The Fourier transform basis function $e^{j\omega t}$ extends to $\pm\infty$ in time, so it inherently cannot provide time localization. We need a method to "localize" the signal to a finite time segment before analysis.
Origin
Dennis Gabor (1946) first proposed using a Gaussian-windowed STFT (which he called "logons") to analyze communication signals in his classic paper Theory of Communication. Gabor's core insight was that information in a signal lies not only in frequency but also in time — we need a joint time-frequency representation.
J. B. Allen (1977) systematized the theory of discrete STFT in his IEEE paper Short Term Spectral Analysis, Synthesis, and Modification by Discrete Fourier Transform, establishing a complete framework for window function selection, Overlap-Add reconstruction, and more, laying the foundation for modern digital speech processing.
Principle
Intuition: Imagine you are listening to a recording of a concert. Instead of listening to the entire song at once and then analyzing frequencies (that would be FFT), you take a "snapshot" at regular short intervals — recording which frequencies are present and how loud they are at each moment. Arrange all snapshots in a row, and you get a Spectrogram.
Steps:
- Choose a window function $w[n]$ (e.g., Hann window), length $L$ samples
- Slide the window to position $m$ in the signal, extract the local segment $x[n] \cdot w[n-m]$
- Perform FFT on this local segment → obtain the local spectrum at position $m$
- Slide the window by $H$ samples (Hop Size), repeat steps 2-3
- Arrange all local spectra along the time axis → Spectrogram
Discrete STFT Definition
$$\text{STFT}\{x\}[m, k] = \sum_{n=0}^{L-1} x[n + mH]\, w[n]\, e^{-j2\pi kn/N_{\text{FFT}}}$$$m$: time frame index, $k$: frequency bin index, $H$: Hop Size, $L$: window length, $N_{\text{FFT}}$: FFT length
Spectrogram
$$S[m,k] = |\text{STFT}\{x\}[m,k]|^2$$Squared magnitude → power spectral density as a function of time
Expand: Continuous STFT Definition and Properties Derivation
Continuous STFT:
$$\text{STFT}\{x\}(t,\omega) = \int_{-\infty}^{\infty} x(\tau)\, w(\tau - t)\, e^{-j\omega\tau}\, d\tau$$This can be understood as the inner product of $x(\tau)$ and $w(\tau - t)e^{j\omega\tau}$ — i.e., the "component magnitude" of the signal near time $t$ and frequency $\omega$.
Inverse transform (reconstruction):
$$x(t) = \frac{1}{2\pi \|w\|^2} \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} \text{STFT}\{x\}(\tau,\omega)\, w(t-\tau)\, e^{j\omega t}\, d\omega\, d\tau$$The prerequisite is that the window function $w(t) \neq 0$ (non-zero window), ensuring perfect reconstruction.
Energy conservation (Parseval-like):
$$\int_{-\infty}^{\infty} |x(t)|^2\, dt = \frac{1}{2\pi \|w\|^2} \int\!\!\int |\text{STFT}\{x\}(t,\omega)|^2\, dt\, d\omega$$Heisenberg Uncertainty Principle
The time resolution $\Delta t$ and frequency resolution $\Delta f$ of STFT cannot both be arbitrarily good simultaneously:
Longer window → smaller $\Delta f$ (clearer frequency view) → but larger $\Delta t$ (blurrier time view). And vice versa. This is not a technical limitation, but a fundamental mathematical limit. STFT uses the same window across the entire time-frequency plane → resolution is the same everywhere → this is a fixed rectangular tiling.
How to Use: Practical Parameter Selection Guide
Step 1: Determine the Frequency Resolution $\Delta f$ You Need
Frequency resolution is determined by window length: $\Delta f \approx f_s / L$ ($L$ = window length in samples).
Formula: To achieve $\Delta f = 4$ Hz with sampling rate $f_s = 1000$ Hz → window length $L = f_s / \Delta f = 1000/4 = 250$ samples.
Step 2: Choose a Window Function
| Window Function | Main Lobe Width | Sidelobe Attenuation | Use Cases |
|---|---|---|---|
| Rectangular | Narrowest ($2f_s/L$) | -13 dB (worst) | Transient analysis, known signals without leakage |
| Hann | $4f_s/L$ | -31 dB | General-purpose first choice |
| Hamming | $4f_s/L$ | -43 dB | Speech analysis |
| Blackman-Harris | $8f_s/L$ | -92 dB | High dynamic range requirements |
| Gaussian ($\alpha$=2.5) | Adjustable | No sidelobes (theoretically) | Gabor analysis, minimum time-frequency area |
Step 3: Determine Overlap Ratio
Overlap ratio = $(L - H)/L \times 100\%$, where $H$ = Hop Size.
- 50% (Hann window): Minimum overlap satisfying the COLA (Constant Overlap-Add) condition, ensuring perfect reconstruction
- 75%: Smoother time axis, better time resolution — recommended for most cases
- 87.5%: For very fine time tracking (e.g., pitch tracking)
Time resolution: $\Delta t = H / f_s$ (time interval between frames).
Step 4: Choose NFFT
NFFT $\geq$ window length $L$, choose the next power of 2 (most efficient for FFT). If NFFT $>$ L → zero-padding → denser frequency axis (interpolation effect, but does not increase true frequency resolution).
Scenario 1: Speech Analysis
| Parameter | Value | Rationale |
|---|---|---|
| $f_s$ | 16 kHz | Standard sampling rate for telephony/speech recognition |
| Window length $L$ | 512 samples = 32 ms | Covers 2-3 fundamental frequency periods (male $F_0 \approx 100$ Hz → 10 ms/period) |
| Overlap | 75% → $H$ = 128 samples | Smooth tracking of formant changes |
| NFFT | 512 | Already a power of 2 |
| $\Delta f$ | $16000/512 = 31.25$ Hz | Sufficient to distinguish adjacent formants |
| $\Delta t$ | $128/16000 = 8$ ms | Sufficient to track rapid speech changes |
Scenario 2: Mechanical Vibration Analysis
| Parameter | Value | Rationale |
|---|---|---|
| $f_s$ | 25.6 kHz | Common for vibration analysis (frequency range up to 10 kHz) |
| Window length $L$ | 4096 samples = 160 ms | Fine frequency resolution needed to distinguish closely spaced gear mesh frequencies |
| Overlap | 75% → $H$ = 1024 samples | Track speed changes |
| NFFT | 4096 | Already a power of 2 |
| $\Delta f$ | $25600/4096 = 6.25$ Hz | Can distinguish harmonics 10 Hz apart |
| $\Delta t$ | $1024/25600 = 40$ ms | Sufficient to track moderate speed changes |
Python Example: Computing a Spectrogram with scipy.signal.stft
Interactive: Spectrogram and Window Length Trade-off
Choose different signals and window lengths to observe how the Spectrogram changes. Longer windows yield clearer frequency axes but blurrier time axes, and vice versa.
Application Scenarios
- Speech recognition front-end (Mel-Spectrogram): The input to modern ASR systems (e.g., Whisper) is STFT → Mel filter bank → log. Typical settings: 25 ms window, 10 ms hop, 80 Mel bins. This produces 100 frames $\times$ 80-dimensional feature matrix per second
- Music Information Retrieval (MIR): Analyzing chord progressions, melody lines, and rhythmic structure in songs. Uses longer windows (2048-4096 samples @ 44.1 kHz = 46-93 ms) for sufficient frequency resolution to distinguish semitones
- Vibration order tracking: During engine acceleration, the diagonal lines on the STFT spectrogram represent the frequency trajectories of various orders. Engineers use this to find resonance points (sudden amplitude increases at certain RPMs)
- Seismic wave analysis: Spectrogram can distinguish P-waves (arrive first, high-frequency) from S-waves (arrive later, low-frequency), aiding in seismic wave characterization
- EEG event-related spectral analysis: Analyzing brainwave frequency changes after specific stimuli (e.g., alpha wave suppression, gamma wave enhancement), window length 0.5-2 seconds
Pitfalls and Limitations
- Window too long → time blurring: If the signal's frequency changes dramatically within 10 ms but you use a 100 ms window, the change gets "averaged out" and appears as a blurry blob on the Spectrogram
- Window too short → frequency blurring: A 32-sample window @ 1 kHz → $\Delta f = 31$ Hz, completely unable to distinguish two tones at 440 Hz and 460 Hz
- No "perfect" window length: The Heisenberg uncertainty principle guarantees that you cannot simultaneously achieve arbitrarily good time and frequency resolution. This is a physical law, not a technical shortcoming
- Spectrogram discards phase: $|STFT|^2$ throws away phase information. When phase is needed (e.g., signal reconstruction, Griffin-Lim algorithm), the full STFT must be retained
- Output size: A 10-second piece of music at $f_s = 44.1$ kHz, window length 2048, hop 512, NFFT 2048 → time frames $\approx 860$, frequency bins = 1025 → approximately 880,000 complex values. Memory requirements for long signals can be substantial
When Not to Use STFT? Alternatives
| Scenario | Problem | Alternative |
|---|---|---|
| Need multi-resolution (frequency detail at low freq, time detail at high freq) | STFT's fixed window cannot adapt | CWT (Wavelet Transform) → Section 5.4 |
| Analyzing very short transients (< a few cycles) | Frequency is meaningless when window is too short | WVD → Section 5.2 or Matching Pursuit |
| Nonlinear, non-stationary complex signals | Sinusoidal basis is not appropriate | EMD / HHT → Section 5.6 |
| Only need to track a few specific frequencies | Computing the full spectrum with STFT is wasteful | Goertzel algorithm or Chirp-Z Transform |
References: [1] Gabor, D., Theory of Communication, J. IEE, 93(26):429-457, 1946. [2] Allen, J.B., Short Term Spectral Analysis, Synthesis, and Modification by Discrete Fourier Transform, IEEE Trans. ASSP, 25(3):235-238, 1977. [3] Oppenheim & Schafer, Discrete-Time Signal Processing, 3rd ed., Ch.10. [4] Griffin & Lim, Signal Estimation from Modified Short-Time Fourier Transform, IEEE Trans. ASSP, 1984.
✅ Quick Check
Q1: For speech analysis using a 32 ms window (fs=16 kHz → 512 samples), what is the frequency resolution?
Show answer
Δf = fs/N = 16000/512 = 31.25 Hz. Sufficient to distinguish speech formants (spaced ~500-1000 Hz apart).
Q2: What happens if the STFT window is too short? Too long?
Show answer
Too short → frequency blurring (cannot resolve frequencies); too long → time blurring (cannot see rapid changes). There is no perfect window length.
5.2 Wigner-Ville Distribution (WVD)
Theoretically the highest-resolution time-frequency distribution — but at the cost of "ghost" artifacts
Why does this matter? Because the time-frequency resolution of STFT is limited by the uncertainty principle. WVD breaks through this limitation — at the cost of cross-terms. Understanding WVD is the foundation for understanding all quadratic time-frequency distributions.
Previously: The STFT in Section 5.1 is limited by the uncertainty principle: fixed window → fixed time-frequency resolution. WVD attempts to break through this limitation — at the cost of cross-terms.
Learning Objectives
- Understand the definition of WVD and its relationship to the local autocorrelation function
- Prove that WVD satisfies perfect Marginal Properties, free from Heisenberg limitations
- Understand the cause, location, and amplitude of Cross-Terms
- Master the smoothing strategies of Pseudo-WVD and Smoothed-WVD
One-Sentence Summary
WVD is theoretically the highest-resolution time-frequency distribution — but it produces "ghost artifacts" (cross-terms) where you don't expect them. Like a funhouse mirror: the main subject is seen very clearly, but strange phantoms appear nearby.
Pain Point: Can We Break the Heisenberg Limit?
STFT's time-frequency resolution is limited by the Heisenberg inequality $\Delta t \cdot \Delta f \geq 1/(4\pi)$ — the window length is fixed, and time and frequency resolution trade off against each other.
Is there a way to break through this limitation and simultaneously achieve perfect time resolution and frequency resolution? WVD's answer is: yes, but at a cost.
Origin
Eugene Wigner (1932) proposed this distribution in quantum mechanics to describe the "quasi-probability distribution" of quantum states in phase space (position-momentum). Wigner noticed it can take negative values — impossible in classical probability, reflecting the non-classical nature of quantum mechanics.
Jean Ville (1948) independently introduced the same mathematical form into signal processing for analyzing instantaneous frequency and power of signals. Hence it is called the Wigner-Ville Distribution.
Interestingly, Wigner later received the 1963 Nobel Prize in Physics — but not for WVD, rather for his contributions to the theory of atomic nuclei and elementary particles.
Principle
Intuition: At each time instant $t$, compute the signal's "local autocorrelation" (centered at $t$, looking at the correlation over $\tau/2$ before and after), then take the Fourier transform with respect to the lag $\tau$ → obtaining the "local spectrum" at that time instant.
Wigner-Ville Distribution
$$W_x(t,\omega) = \int_{-\infty}^{\infty} x\!\left(t + \frac{\tau}{2}\right)\, x^*\!\left(t - \frac{\tau}{2}\right)\, e^{-j\omega\tau}\, d\tau$$Interpretation: $x(t+\tau/2) \cdot x^*(t-\tau/2)$ is the "instantaneous autocorrelation function" centered at $t$. Taking the FT with respect to $\tau$ → just like the Wiener-Khinchin theorem, the FT of the autocorrelation = power spectrum. Except here it is local and time-varying.
Perfect Marginal Properties
WVD satisfies the following remarkable properties:
This means WVD is an "ideal" energy distribution — the time and frequency projections exactly equal the instantaneous power and power spectrum, respectively. No Heisenberg inequality limitation.
Expand: Time Marginal Property Proof
Compute $\int W_x(t,\omega)\, d\omega/(2\pi)$:
$$\int \frac{d\omega}{2\pi} \int x\!\left(t+\frac{\tau}{2}\right) x^*\!\left(t-\frac{\tau}{2}\right) e^{-j\omega\tau}\, d\tau$$Swap the order of integration:
$$= \int x\!\left(t+\frac{\tau}{2}\right) x^*\!\left(t-\frac{\tau}{2}\right) \underbrace{\left[\int \frac{e^{-j\omega\tau}}{2\pi}\, d\omega\right]}_{\delta(\tau)}\, d\tau$$Using the sifting property of $\delta(\tau)$, set $\tau = 0$:
$$= x(t)\, x^*(t) = |x(t)|^2 \quad \blacksquare$$Expand: Exact WVD Result for a Single-Component Chirp
Consider a linear chirp $x(t) = e^{j(\omega_0 t + \beta t^2/2)}$ (instantaneous frequency $\omega_i(t) = \omega_0 + \beta t$).
Substituting into the WVD:
$$x\!\left(t+\frac{\tau}{2}\right) x^*\!\left(t-\frac{\tau}{2}\right) = e^{j(\omega_0 + \beta t)\tau}$$Therefore:
$$W_x(t,\omega) = \int e^{j(\omega_0+\beta t)\tau}\, e^{-j\omega\tau}\, d\tau = 2\pi\,\delta(\omega - \omega_0 - \beta t)$$The WVD is precisely concentrated on the line of instantaneous frequency $\omega_i(t) = \omega_0 + \beta t$ in the time-frequency plane — perfect time-frequency localization, with no blurring whatsoever. $\blacksquare$
The Fatal Problem: Cross-Terms
For a multi-component signal $x = x_1 + x_2$:
where the cross-term $W_{x_1,x_2}(t,\omega) = \int x_1(t+\tau/2)\, x_2^*(t-\tau/2)\, e^{-j\omega\tau}\, d\tau$.
Three alarming properties of cross-terms:
- Location: They appear at the midpoint of the time-frequency positions of $x_1$ and $x_2$ (average in both time and frequency)
- Amplitude: They can be as large as the auto-terms, or even larger
- Oscillation: Cross-terms oscillate rapidly (in the direction of the time-frequency difference between the two components), with frequency proportional to their time-frequency separation
For a signal with $N$ components, there are $N$ auto-terms but $N(N-1)/2$ cross-terms — when $N$ is large, cross-terms completely overwhelm the auto-terms!
How to Use
- Obtain the Analytic Signal: First apply the Hilbert transform to $x(t)$ to get the analytic signal $z(t) = x(t) + j\hat{x}(t)$. This eliminates cross-terms between positive and negative frequencies (which appear near zero frequency and are annoying)
-
Compute the discrete WVD:
% MATLAB / Octave conceptual code N = length(z); WV = zeros(N, N); for n = 1:N for tau = -(N-1):(N-1) n1 = n + round(tau/2); n2 = n - round(tau/2); if n1 >= 1 && n1 <= N && n2 >= 1 && n2 <= N WV(n, :) = WV(n, :) + z(n1)*conj(z(n2)) * exp(-1j*2*pi*(0:N-1)*tau/N); end end end
Note: Computational complexity is $O(N^2)$, much larger than STFT's $O(N \log N)$.
-
Apply kernel smoothing to suppress cross-terms:
- Pseudo-WVD (PWVD): Window only in the $\tau$ direction → suppresses cross-term components far from $\tau=0$
- Smoothed Pseudo-WVD (SPWVD): Window in both $t$ and $\tau$ directions → stronger cross-term suppression, but greater resolution loss
- Visualization: Plot $W_x(t,\omega)$. Note that WVD can take negative values — this is not an error, but its non-classical property (similar to quasi-probability distributions in quantum mechanics)
Application Scenarios
- Precise time-frequency analysis of single-component chirp signals: For linear frequency-modulated (LFM) chirp pulses in radar returns, WVD achieves the theoretical limit of precision. For example: a radar chirp with 10 MHz bandwidth and 100 $\mu$s pulse width — WVD perfectly recovers its time-frequency slope $\beta = 10^{11}$ Hz/s
- Radar Ambiguity Function: The ambiguity function $A(\tau, \nu)$ is exactly the 2D Fourier transform of the WVD. Therefore, WVD properties directly correspond to the range-velocity resolution capability of the radar waveform
- Quantum optics / quantum information: The Wigner function remains the primary tool for describing quantum states of light fields (e.g., squeezed states, Fock states, cat states) to this day
Pitfalls and Limitations
- Nearly unusable directly for multi-component signals: 3 components yield 3 cross-terms, 10 components yield 45 cross-terms. After smoothing, the resolution advantage is greatly diminished
- Computational complexity $O(N^2)$: Much larger than STFT's $O(N \log N)$. At N = 10000, WVD requires approximately 750 times the computation time of STFT
- Can take negative values: Cannot be directly interpreted as an "energy distribution" — although the marginals are correct, local negative values make physical interpretation difficult
- Discretization difficulties: 2x oversampling is required to avoid aliasing in the discrete WVD
When Not to Use WVD? Alternatives
| Scenario | Problem | Alternative |
|---|---|---|
| Multi-component signals | Too many cross-terms | Smoothed Pseudo-WVD or Choi-Williams → 5.3 Cohen's Class |
| Need fast computation | $O(N^2)$ too slow | STFT ($O(N\log N)$) |
| Nonlinear non-stationary signals | Quadratic distribution assumptions not flexible enough | EMD / HHT → Section 5.6 |
| Need cross-term-free high resolution | WVD cross-terms cannot be eliminated | SST → Section 5.7 |
References: [1] Wigner, E., On the Quantum Correction for Thermodynamic Equilibrium, Phys. Rev., 40:749-759, 1932. [2] Ville, J., Theorie et Applications de la Notion de Signal Analytique, Cables et Transmission, 2(1):61-74, 1948. [3] Claasen, T.A.C.M. & Mecklenbräuker, W.F.G., The Wigner Distribution, Philips J. Res., 1980. [4] Cohen, L., Time-Frequency Analysis, Prentice Hall, 1995.
Interactive: WVD vs STFT Side-by-Side Comparison
Compare the STFT spectrogram with Pseudo-WVD. WVD has better time-frequency resolution, but dual-component signals exhibit cross-terms.
5.3 Cohen's Class: Generalized Quadratic Time-Frequency Distributions
Unified framework — all quadratic time-frequency distributions are smoothed versions of WVD
Why does this matter? Because STFT and WVD are just two extremes of time-frequency analysis. Cohen's Class provides a unified mathematical framework, letting you understand that all quadratic time-frequency distributions are different choices of "how to smooth the WVD."
Previously: The WVD in Section 5.2 has cross-term problems. The STFT in Section 5.1 has no cross-terms but poor resolution. Cohen's Class provides a unified framework for finding the optimal trade-off between the two.
Learning Objectives
- Understand the unified formula of Cohen's Class: the kernel function $\Phi(\theta,\tau)$ completely determines the distribution's properties
- Recognize the trade-off between auto-term preservation and cross-term suppression via the kernel function
- Compare specific distributions such as WVD, Spectrogram, and Choi-Williams
- Choose appropriate kernel functions based on signal characteristics
One-Sentence Summary
Cohen's Class is a unified framework: the STFT Spectrogram, WVD, and all time-frequency distributions in between — the only difference is "which kernel function smoothes the WVD." Choosing a kernel is like turning a radio dial: one end gives the highest resolution but with cross-terms (WVD), the other end gives no cross-terms but is blurry (Spectrogram).
Pain Point: Can We Compromise Between STFT and WVD?
We now face two extremes:
- STFT Spectrogram: No cross-terms ✓ but resolution limited by Heisenberg ✗
- WVD: Unlimited resolution ✓ but severe cross-terms for multi-component signals ✗
Is there a distribution between the two — retaining most of the resolution advantage while effectively suppressing cross-terms? Cohen's Class provides a systematic answer.
Origin
Leon Cohen (1966, 1989) proposed this unified theory. His core insight was that all "reasonable" quadratic time-frequency distributions (i.e., those satisfying time-shift and frequency-shift covariance) can be described by the same mathematical framework, differing only in the choice of a 2D kernel function $\Phi(\theta, \tau)$.
His 1989 review paper Time-Frequency Distributions — A Review published in Proceedings of the IEEE became one of the most cited papers in the time-frequency analysis field (over 5000 citations).
Principle
Intuition: Imagine the WVD as a high-resolution photo with lots of noise (cross-terms). Cohen's Class applies different blur filters to this photo — the stronger the filter, the less noise, but also the more blurred the details. Different kernel functions = different blur filters.
Cohen's Class Unified Formula
$$C_x(t,\omega) = \frac{1}{4\pi^2}\int\!\!\int\!\!\int e^{-j\theta t - j\tau\omega + j\theta u}\, \Phi(\theta,\tau)\, x\!\left(u+\frac{\tau}{2}\right) x^*\!\left(u-\frac{\tau}{2}\right)\, du\, d\tau\, d\theta$$$\Phi(\theta,\tau)$: kernel function, completely determines the distribution's properties
Equivalently, it can be written as the 2D convolution of the WVD with the kernel function:
where $\phi(t,\omega)$ is the 2D Fourier transform of $\Phi(\theta,\tau)$
Implication: All Cohen's Class distributions are some form of 2D smoothing of the WVD. The kernel function determines in which directions and how much smoothing is applied in the time-frequency plane.
Expand: Relationship Between Kernel Properties and Marginal Conditions
Theorem: The necessary and sufficient condition for a Cohen's Class distribution $C_x$ to satisfy the marginal properties (i.e., $\int C_x\, d\omega = |x(t)|^2$ and $\int C_x\, dt = |X(\omega)|^2$) is:
$$\Phi(\theta, 0) = 1 \;\;\forall\theta \quad \text{and} \quad \Phi(0, \tau) = 1 \;\;\forall\tau$$Proof (time marginal):
$$\int C_x(t,\omega)\, \frac{d\omega}{2\pi} = \frac{1}{2\pi}\int\!\!\int\!\!\int\!\!\int e^{-j\theta t - j\tau\omega + j\theta u}\, \Phi\, K_x(u,\tau)\, du\, d\tau\, d\theta\, d\omega$$First integrate over $\omega$: $\int e^{-j\tau\omega}\, d\omega/(2\pi) = \delta(\tau)$, setting $\tau = 0$:
$$= \frac{1}{2\pi}\int\!\!\int e^{-j\theta t + j\theta u}\, \Phi(\theta, 0)\, |x(u)|^2\, du\, d\theta$$If $\Phi(\theta, 0) = 1$: $= \int |x(u)|^2\, \delta(t-u)\, du = |x(t)|^2 \quad \blacksquare$
Note: Most practical kernel functions do not simultaneously satisfy both marginal conditions. Sacrificing marginal properties is the cost of cross-term suppression.
Comparison of Common Kernel Functions
| Kernel $\Phi(\theta,\tau)$ | Distribution Name | Auto-Term Preservation | Cross-Term Suppression | Marginal Properties |
|---|---|---|---|---|
| $\Phi = 1$ | WVD | Perfect | None | Perfect |
| $\Phi = e^{-\theta^2\tau^2/\sigma}$ | Choi-Williams (CWD) | Good | Suppresses cross-terms far from origin | Satisfied |
| $\Phi = e^{-|\theta\tau|^\alpha}$ | Zhao-Atlas-Marks (ZAM) | Good | Suppresses off-axis cross-terms | Satisfied |
| $\Phi = h(\tau)$ (depends only on $\tau$) | Pseudo-WVD | Moderate | Smoothing in $\tau$ direction | Frequency marginal ✓ Time marginal ✗ |
| $|H(\omega)|^2$ (Spectrogram) | Spectrogram | Most blurred | Fully suppressed | Not satisfied |
The Spectrogram is also a member of Cohen's Class! It can be shown that $|\text{STFT}|^2$ is equivalent to smoothing the signal's WVD with the WVD of the window function as the kernel:
$S_x(t,\omega) = \iint W_h(t'-t, \omega'-\omega)\, W_x(t',\omega')\, dt'\, d\omega'$
where $W_h$ is the WVD of the window function $h$. This is why the Spectrogram has no cross-terms — because it uses the "heaviest" smoothing.
How to Use
- Analyze your signal characteristics: Number of components? Time-frequency separation? SNR?
- Choose a kernel function:
- Single component → WVD ($\Phi = 1$), no cross-term issues
- Few components, well separated → Choi-Williams ($\sigma = 1$~$10$), usually the best starting point for trade-offs
- Many components or closely spaced → Spectrogram (fully suppress cross-terms, accept resolution loss)
- Tune parameters: Larger $\sigma$ in Choi-Williams → less smoothing → closer to WVD; smaller $\sigma$ → more smoothing → closer to Spectrogram
- Visualize and verify: Check for residual cross-terms (oscillations appearing where no components should exist)
Practical advice: First use a Spectrogram to quickly observe the signal's overall time-frequency structure and confirm the number and locations of components. Then use Choi-Williams to improve resolution, starting from $\sigma = 1$ and gradually increasing until cross-terms begin to appear.
Application Scenarios
- Engine acceleration analysis: The Choi-Williams distribution can clearly show how gear mesh frequencies rise linearly with RPM while suppressing cross-terms between different orders. Typical parameter $\sigma = 5$, analyzing 0-6000 RPM acceleration
- Speech transition analysis: Consonant-to-vowel transitions (e.g., /ba/ → /a/) involve rapid formant migration. CWD can more precisely localize the time and frequency trajectories of transitions than the Spectrogram
- Underwater acoustics: Multipath propagation in underwater channels causes multiple time-frequency components to overlap. Cohen's Class distributions help separate arrival times and Doppler shifts of different paths
Pitfalls and Limitations
- No "best" kernel function: Different signals require different kernels. Automatic kernel selection remains an open research problem
- Marginal properties are usually sacrificed: Most practical kernel functions do not satisfy the perfect marginal conditions, resulting in errors in time or frequency projections
- Computational cost: General Cohen's Class computation is $O(N^2)$ or more (triple integral), much slower than STFT
- Can take negative values: Like WVD, Cohen's Class distributions can generally take negative values (unless the kernel is designed to be positive definite, such as the Spectrogram)
When Not to Use? Alternatives
| Scenario | Problem | Alternative |
|---|---|---|
| Signal is multiple pure sinusoids + noise | Quadratic distributions are not the best tool | MUSIC / ESPRIT (parametric methods are more direct) |
| Need real-time processing | $O(N^2)$ too slow | STFT ($O(N\log N)$), trade resolution for speed |
| Highly nonlinear signals | Quadratic distribution assumptions (stationarity approximation) do not hold | EMD / HHT → Section 5.6 |
| Need multi-scale analysis | Cohen's Class has fixed resolution | CWT → Section 5.4 |
References: [1] Cohen, L., Generalized Phase-Space Distribution Functions, J. Math. Phys., 7(5):781-786, 1966. [2] Cohen, L., Time-Frequency Distributions — A Review, Proc. IEEE, 77(7):941-981, 1989. [3] Choi, H. & Williams, W., Improved Time-Frequency Representation of Multicomponent Signals, IEEE Trans. ASSP, 37(6):862-871, 1989. [4] Hlawatsch, F. & Boudreaux-Bartels, G.F., Linear and Quadratic Time-Frequency Signal Representations, IEEE SP Magazine, 1992.
5.4 Continuous Wavelet Transform (CWT)
Multi-resolution time-frequency analysis — frequency detail at low frequencies, time detail at high frequencies, automatically adaptive
Why does this matter? Because STFT has a fixed window length, but you often need to simultaneously resolve low frequencies and pinpoint high-frequency transients. The multi-resolution property of wavelets automatically resolves this contradiction — this is why wavelets are extensively used in seismology, finance, and neuroscience.
Previously: Cohen's Class in Section 5.3 still uses a fixed analysis window. CWT uses scalable wavelets — automatically using long windows for low frequencies and short windows for high frequencies — breaking through the fixed-window limitation.
🌉 From STFT to CWT: Why do we need wavelets?
STFT analyzes signals with a fixed-size window, which leads to a fundamental problem:
- Low-frequency signals change slowly and need long windows to resolve frequency (but you lose time localization)
- High-frequency signals change quickly and need short windows to localize events (but you lose frequency precision)
- STFT can only pick one fixed window length → both ends are unsatisfied
Concrete example: When analyzing music, bass notes (50–200 Hz, lasting 0.5 s) need a ~200 ms window, but a cymbal hit (5 kHz, lasting 5 ms) needs a ~5 ms window. STFT cannot do both at once.
CWT's solution: use a "stretchable window" — automatically shorter when analyzing high frequencies (good time precision), automatically longer when analyzing low frequencies (good frequency precision). This is multi-resolution analysis.
Below we'll see how CWT uses a single mother wavelet $\psi(t)$ together with two parameters — scale (window width) and translation (window position) — to achieve this adaptive analysis.
Learning Objectives
- Understand how CWT achieves multi-resolution analysis through scale dilation
- Master the physical meaning of the Admissibility Condition
- Compare the characteristics and use cases of Morlet and Mexican Hat wavelets
- Be able to convert between scale $a$ and frequency $f$
- Understand the boundary effects of the Cone of Influence
One-Sentence Summary
CWT is an evolution of STFT — long windows for low frequencies to resolve frequency detail, short windows for high frequencies to resolve time detail, automatically adaptive. Like a zoom lens on a telescope: automatically zooming in for distant objects (low frequencies) and zooming out for nearby objects (high frequencies).
Pain Point: The Fixed-Window Dilemma of STFT
STFT's window length is fixed. But many real signals require different resolutions at different frequencies:
- Seismic waves: Low-frequency components ($< 1$ Hz surface waves) require very long windows ($> 5$ seconds) to resolve frequencies, while high-frequency components ($> 10$ Hz body waves) need short windows ($< 100$ ms) to precisely locate arrival times. STFT cannot achieve both
- Music: The low notes C2 (65 Hz) and C#2 (69 Hz) differ by only 4 Hz, requiring long windows to distinguish; but percussive sounds in the high range need millisecond-level time localization
- Biomedical signals: The ECG QRS complex lasts 60-100 ms (high frequency), while the T wave lasts 200-400 ms (low frequency) — analysis requirements are completely different
Fundamental limitation of STFT: Fixed window = fixed rectangular time-frequency tiles. All frequencies use tiles of the same size, unable to simultaneously meet the different needs of low and high frequencies. CWT's solution: let the tile dimensions automatically adjust with frequency.
Origin
Jean Morlet (1982), a French geophysicist, first proposed the concept of wavelet analysis while studying seismic waves. He found that Fourier analysis worked poorly for seismic signals (which are transient and non-stationary), and conceived the idea of using "small waves" (wavelets) of different widths to match components of different frequencies.
Alex Grossmann & Jean Morlet (1984) jointly published the rigorous mathematical framework in SIAM, defining the continuous wavelet transform and the admissibility condition. Grossmann was a theoretical physicist who provided a solid mathematical foundation for Morlet's intuition.
Ingrid Daubechies (1988) further established a rigorous framework for wavelet theory, constructing orthogonal wavelet bases with compact support, opening the era of discrete wavelets (→ Section 5.5).
Principle
Intuition: Choose a "mother wavelet" $\psi(t)$ (a short oscillating waveform). To analyze low frequencies → stretch the mother wavelet (increase scale $a$) → better match for low frequencies, window length automatically increases. To analyze high frequencies → compress the mother wavelet (decrease scale $a$) → better match for high frequencies, window length automatically decreases.
Continuous Wavelet Transform (CWT)
$$W_x(a, b) = \frac{1}{\sqrt{|a|}}\int_{-\infty}^{\infty} x(t)\, \psi^*\!\left(\frac{t-b}{a}\right)\, dt$$$a$: Scale, controls the wavelet width ($a$ large → stretched → low frequency)
$b$: Translation, controls the wavelet's time position
$\psi$: Mother Wavelet
$1/\sqrt{|a|}$: Energy normalization factor
Admissibility Condition
The mother wavelet must satisfy:
This condition requires $\hat{\Psi}(0) = 0$, meaning the mother wavelet must have zero mean: $\int \psi(t)\, dt = 0$.
Physical intuition: the wavelet must be oscillatory (alternating positive and negative), with no DC component. This guarantees the invertibility of the CWT.
Expand: CWT Inverse Transform Formula Derivation
CWT inverse transform (reconstruction formula):
$$x(t) = \frac{1}{C_\psi}\int_0^{\infty}\int_{-\infty}^{\infty} W_x(a,b)\, \frac{1}{\sqrt{a}}\,\psi\!\left(\frac{t-b}{a}\right)\, db\, \frac{da}{a^2}$$The key step in the proof uses Parseval's theorem and the admissibility condition, working in the frequency domain:
$$\hat{W}_x(a,\omega) = \sqrt{a}\, \hat{X}(\omega)\, \hat{\Psi}^*(a\omega)$$Substituting into the inverse transform, when integrating over $a$ we use:
$$\int_0^{\infty} |\hat{\Psi}(a\omega)|^2\, \frac{da}{a} = C_\psi \quad \text{(independent of $\omega$)}$$Therefore $x(t)$ is perfectly reconstructed. $\blacksquare$
Multi-Resolution Property: CWT vs STFT Time-Frequency Tiles
| STFT | CWT | |
|---|---|---|
| Tile shape | Fixed-size rectangles | Varies with scale: wide-tall at low freq, narrow-short at high freq |
| Low-freq behavior | $\Delta f$ fixed (may not be small enough) | $\Delta f$ small (long window → high frequency resolution) |
| High-freq behavior | $\Delta t$ fixed (may not be small enough) | $\Delta t$ small (short window → high time resolution) |
| Area $\Delta t \cdot \Delta f$ | Fixed (determined by window function) | Fixed (determined by mother wavelet) |
| Basis functions | Translation + modulation (windowed $e^{j\omega t}$) | Translation + dilation (scaled $\psi$) |
Key understanding: CWT does not "violate" the Heisenberg uncertainty principle — each tile's area $\Delta t \cdot \Delta f$ is still bounded. The difference is that the tile shape automatically adjusts with frequency, giving each frequency the most suitable time/frequency resolution ratio.
Common Mother Wavelets
Morlet Wavelet (Most Common for Time-Frequency Analysis)
$\omega_0 \approx 5$~$6$ (center frequency, typically $\omega_0 = 6$)
Complex exponential with a Gaussian envelope → approximately Gaussian in both time and frequency domains → minimum time-frequency area (approaching the Heisenberg limit). The first choice for time-frequency analysis.
Mexican Hat (DOG-2, Common for Singularity Detection)
Negative second derivative of a Gaussian. Real-valued wavelet → no complex phase information. Shaped like a Mexican hat. Suitable for detecting signal peaks and singularities.
Scale-Frequency Correspondence
$f_\psi$: center frequency of the mother wavelet (Morlet with $\omega_0 = 6$: $f_\psi \approx 0.955$ Hz); $\Delta t$: sampling interval
Note: The scale-frequency correspondence is not exact and depends on the spectral shape of the mother wavelet. Different mother wavelets have different $f_\psi$ values. Do not equate scale with frequency.
How to Use: Step-by-Step
-
Step 1: Choose a Mother Wavelet
- Time-frequency analysis → Morlet (complex-valued, has phase, best time-frequency resolution)
- Singularity / edge detection → Mexican Hat (real-valued, sensitive to peaks)
- Need consistency with DWT → Daubechies (compact support, for validating DWT results)
-
Step 2: Determine Scale Range $[a_{\min}, a_{\max}]$
- Corresponding frequency range $[f_{\min}, f_{\max}]$: $a_{\min} = f_\psi / (f_{\max} \cdot \Delta t)$, $a_{\max} = f_\psi / (f_{\min} \cdot \Delta t)$
- Scales are typically sampled geometrically (logarithmic spacing): $a_k = a_{\min} \cdot 2^{k \cdot dj}$, $dj$ commonly $1/12$~$1/4$
-
Step 3: Compute CWT Coefficients
% MATLAB example scales = 2.^(0:0.1:7); % scale range coefs = cwtft(x, 'wavelet','morl','scales',scales); % or use built-in function [wt, f] = cwt(x, fs, 'amor'); % 'amor' = analytic Morlet
-
Step 4: Plot the Scalogram
$|W_x(a,b)|^2$ — energy distribution over scale (or corresponding frequency) and time. The y-axis is typically on a logarithmic scale.
Concrete Example: ECG QRS Complex Detection
| Parameter | Value | Rationale |
|---|---|---|
| Signal | ECG, $f_s = 360$ Hz | MIT-BIH database standard |
| Mother wavelet | Morlet ($\omega_0 = 6$) | Need both time localization and frequency analysis |
| Frequency range | 10-40 Hz | Main energy band of the QRS complex |
| Scale range | $a \approx 8.6$~$34.4$ | $a = f_\psi / (f \cdot \Delta t)$, $\Delta t = 1/360$ |
| Result | Scalogram peaks in the 10-40 Hz band | Precisely correspond to the R-peak position of each heartbeat |
Interactive: CWT Scalogram
Morlet CWT scalogram of a chirp signal. Note that in the low-frequency region, frequency resolution is high but time resolution is low (wide-thin tiles), while in the high-frequency region, time resolution is high but frequency resolution is low (narrow-thick tiles).
Application Scenarios
- Seismology: CWT is the standard tool for analyzing seismic waves. Low-frequency surface waves (0.01-0.1 Hz) use large scales for precise frequency measurement (determining crustal thickness), while high-frequency body waves (1-20 Hz) use small scales for precise arrival time localization (locating the epicenter). Morlet CWT is widely used at the IRIS seismic data center
- Financial time series: Morlet CWT is used to analyze multi-scale volatility of stock indices. Large scales (months to years) reveal business cycles, small scales (days to weeks) show short-term fluctuations. It can reveal how the dominant period changes across different time segments
- Neuroscience (brainwave analysis): Analyzing event-related spectral perturbations (ERSP) in EEG. CWT automatically provides appropriate time/frequency resolution across delta (1-4 Hz), theta (4-8 Hz), alpha (8-12 Hz), beta (13-30 Hz), and gamma (30-100 Hz) bands
- Mechanical fault transient detection: Periodic impact pulses from bearing damage (high-frequency transients) — CWT can precisely localize the time of each impact while analyzing its frequency characteristics (determining whether it's an inner ring, outer ring, or rolling element fault)
Pitfalls and Limitations
- Scale-frequency correspondence is imprecise: Different mother wavelets have different $f_\psi$, and the spectral width of the mother wavelet means each scale corresponds to a frequency range, not a single frequency
- Boundary effects — Cone of Influence (COI): At the beginning and end of the signal, large-scale wavelets "extend" beyond the signal boundary. Regions outside the COI are unreliable — typically marked as shaded regions on the scalogram
- Higher computational cost than STFT: CWT computes over all scales and translations → $O(N \cdot N_{\text{scales}})$. With frequency-domain convolution acceleration → $O(N \log N \cdot N_{\text{scales}})$
- Highly redundant: CWT produces far more coefficients than the original signal length (continuous $a$ and $b$), making it unsuitable for compression
- Reconstruction requires the admissibility condition: If the mother wavelet does not strictly satisfy the admissibility condition (e.g., Morlet wavelet when $\omega_0 < 5$), reconstruction will have errors
When Not to Use CWT? Alternatives
| Scenario | Problem | Alternative |
|---|---|---|
| Only need octave band decomposition (compression, denoising) | CWT is too redundant and slow | DWT ($O(N)$, no redundancy) → Section 5.5 |
| Need fixed frequency resolution | CWT's frequency resolution varies with scale | STFT (fixed $\Delta f$) → Section 5.1 |
| Need maximum time-frequency concentration | CWT is blurred due to mother wavelet width | SST (sharpened CWT) → Section 5.7 |
| Nonlinear signals, don't want preset basis | Wavelets are still predefined basis functions | EMD / HHT → Section 5.6 |
References: [1] Morlet, J. et al., Wave Propagation and Sampling Theory, Geophysics, 47(2):203-236, 1982. [2] Grossmann, A. & Morlet, J., Decomposition of Hardy Functions into Square Integrable Wavelets of Constant Shape, SIAM J. Math. Anal., 15:723-736, 1984. [3] Daubechies, I., Ten Lectures on Wavelets, SIAM, 1992. [4] Torrence, C. & Compo, G.P., A Practical Guide to Wavelet Analysis, Bull. Amer. Meteorol. Soc., 1998.
✅ Quick Check
Q1: What is the fundamental difference between CWT and STFT?
Show answer
STFT has a fixed window width. CWT's window width adaptively adjusts with frequency: long windows for low frequencies (high frequency resolution), short windows for high frequencies (high time resolution).
Q2: Why does the mother wavelet need to satisfy the zero-mean condition?
Show answer
This is the admissibility condition, which ensures the CWT is invertible. Physical meaning: the wavelet must be oscillatory (with both positive and negative values), and cannot be purely positive or purely negative.
5.5 Discrete Wavelet Transform (DWT)
Mallat's filter bank architecture — peeling frequency bands layer by layer like an onion
Why does this matter? Because CWT has high computational cost and high redundancy. DWT achieves the same multi-resolution analysis with O(N) computation, and is the industrial standard for JPEG 2000 image compression and ECG denoising.
Previously: CWT in Section 5.4 has high computational cost (continuous scales and translations). DWT uses filter banks to achieve an O(N) discrete version, and is the industrial standard for image compression and denoising.
Learning Objectives
- Understand that DWT is equivalent to iterated two-channel filter banks (Mallat's algorithm)
- Master the octave band decomposition structure in the frequency domain
- Understand the intuitive meaning and practical impact of Vanishing Moments
- Learn to select wavelet families and decomposition levels, and apply them to denoising and compression
- Recognize the shift-invariance problem of DWT and its solutions
One-Sentence Summary
DWT uses a recursive set of lowpass + highpass filters to decompose the signal layer by layer — peeling apart different frequency bands like an onion. Each layer halves the frequency range and data size, so the entire computation requires only $O(N)$ time.
Pain Point: CWT Is Too Redundant and Too Slow
CWT computes over continuous scales $a$ and continuous translations $b$ → producing a large number of redundant coefficients (far more than the original signal data). This is fine for theoretical analysis, but causes three problems in engineering applications:
- High computational cost: $O(N \cdot N_{\text{scales}} \cdot \log N)$, impractical for long signals
- High redundancy: Not suitable for compression (the goal is to represent the signal with the fewest coefficients)
- No orthogonality: Continuous-scale wavelets do not form an orthogonal basis, disadvantageous for mathematical analysis
Solution: Compute only at dyadic sampling scales $a = 2^j$ and translations $b = k \cdot 2^j$ → DWT.
Origin
Stephane Mallat (1989) made the key contribution connecting wavelet theory with engineering practice: he proved that DWT is equivalent to iterated two-channel filter banks, computable in $O(N)$ time. This algorithm, called Mallat's algorithm (or the Pyramid Algorithm), brought wavelets from theory to practice.
Ingrid Daubechies (1988) constructed orthogonal wavelet bases with compact support — meaning the filters have finite length and can be exactly implemented digitally. The Daubechies wavelet family she constructed (db1=Haar, db2, db3, ..., dbN) remains the most widely used wavelet family to this day.
Daubechies later received the Fudan-Zhongzhi Science Award in 2019 and the Wolf Prize in Mathematics in 2023, and is one of the most important founders of wavelet theory.
Principle: Mallat's Fast Wavelet Transform
Intuition: Each decomposition level passes the signal through two filters — a lowpass (retaining the low-frequency "rough outline") and a highpass (retaining the high-frequency "details"). The low-frequency part is then downsampled (since the bandwidth is halved, the sampling rate can be halved too), and the same process is repeated. Like peeling an onion: each layer reveals deeper structure.
Decomposition (Analysis)
$$c_{A}^{(j+1)}[k] = \sum_{n} h[n - 2k]\, c_{A}^{(j)}[n] \quad \text{(lowpass filtering + downsample by 2 → Approximation coefficients)}$$ $$c_{D}^{(j+1)}[k] = \sum_{n} g[n - 2k]\, c_{A}^{(j)}[n] \quad \text{(highpass filtering + downsample by 2 → Detail coefficients)}$$$h[n]$: lowpass decomposition filter (corresponding to scaling function); $g[n]$: highpass decomposition filter (corresponding to wavelet function)
$g[n] = (-1)^n h[L-1-n]$ (QMF relationship)
Reconstruction (Synthesis)
$$c_{A}^{(j)}[n] = \sum_{k} \tilde{h}[n - 2k]\, c_{A}^{(j+1)}[k] + \sum_{k} \tilde{g}[n - 2k]\, c_{D}^{(j+1)}[k]$$$\tilde{h}, \tilde{g}$: reconstruction filters. The Perfect Reconstruction (PR) condition guarantees $x = \text{IDWT}(\text{DWT}(x))$.
Octave Decomposition in the Frequency Domain
Assuming $f_s$ is the sampling rate, effective frequency range $[0, f_s/2]$:
| Decomposition Level | Detail Coefficient Band | Approximation Coefficient Band | Number of Coefficients |
|---|---|---|---|
| Level 1 | $[f_s/4,\; f_s/2]$ | $[0,\; f_s/4]$ | $N/2$ + $N/2$ |
| Level 2 | $[f_s/8,\; f_s/4]$ | $[0,\; f_s/8]$ | $N/4$ + $N/4$ |
| Level 3 | $[f_s/16,\; f_s/8]$ | $[0,\; f_s/16]$ | $N/8$ + $N/8$ |
| Level $J$ | $[f_s/2^{J+1},\; f_s/2^J]$ | $[0,\; f_s/2^{J+1}]$ | $N/2^J$ + $N/2^J$ |
Total number of coefficients = $N$ (no redundancy). Computation = $N + N/2 + N/4 + \cdots < 2N = O(N)$.
Expand: Perfect Reconstruction (PR) Condition Derivation
In the $z$-domain, the analysis-synthesis system output of the two-channel filter bank is:
$$\hat{X}(z) = \frac{1}{2}\left[\tilde{H}(z)H(z) + \tilde{G}(z)G(z)\right]X(z) + \frac{1}{2}\left[\tilde{H}(z)H(-z) + \tilde{G}(z)G(-z)\right]X(-z)$$Perfect reconstruction requires:
- Aliasing Cancellation: $\tilde{H}(z)H(-z) + \tilde{G}(z)G(-z) = 0$
- No Distortion: $\tilde{H}(z)H(z) + \tilde{G}(z)G(z) = 2z^{-d}$ (perfect pass-through with delay $d$)
For orthogonal wavelets ($\tilde{H} = H$, $\tilde{G} = G$), we only need $H(z)$ to satisfy:
$$|H(e^{j\omega})|^2 + |H(e^{j(\omega+\pi)})|^2 = 2$$This is the key equation Daubechies used to construct orthogonal wavelets. $\blacksquare$
Intuition of Vanishing Moments
A wavelet $\psi(t)$ having $p$ vanishing moments means:
Intuitive explanation: $p$ vanishing moments = the wavelet is "blind to" polynomials of degree $p-1$.
- 1 vanishing moment (Haar): Blind to constants (degree 0 polynomials) → only captures the "difference from a constant"
- 2 vanishing moments (db2): Blind to constants and linear trends → only captures the "difference from a straight line"
- 4 vanishing moments (db4): Blind to polynomials of degree $\leq 3$ → only captures higher-order details
Practical impact: More vanishing moments → smaller wavelet coefficients in smooth regions (approaching zero) → higher compression efficiency (more coefficients can be discarded). But the cost is: longer filters (db$p$ filter length = $2p$) → more computational delay and more severe boundary effects.
How to Use: Step-by-Step
-
Step 1: Choose a Wavelet Family
Wavelet Filter Length Vanishing Moments Characteristics Use Cases Haar (db1) 2 1 Simplest, discontinuous Binary signals, step detection db4 8 4 Compact support, asymmetric General-purpose first choice sym8 16 8 Nearly symmetric When near-linear phase is needed coif3 18 6 (wavelet) + 6 (scaling) Both wavelet and scaling function have vanishing moments Numerical analysis, approximation CDF 9/7 9+7 (biorthogonal) 4+4 Symmetric, floating-point JPEG 2000 image compression -
Step 2: Choose the Decomposition Level $J$
$J$ = how many octave bands you need. Maximum $J_{\max} = \lfloor\log_2(N/L)\rfloor$ ($L$ = filter length).
Rule of thumb: $J = \lfloor\log_2(N)\rfloor - 1$ or determined by frequency band requirements.
-
Step 3: Decompose
% MATLAB [C, L] = wavedec(x, J, 'db4'); % C = [cA_J | cD_J | cD_{J-1} | ... | cD_1] % L = coefficient length at each level # Python (PyWavelets) import pywt coeffs = pywt.wavedec(x, 'db4', level=J) # coeffs = [cA_J, cD_J, cD_{J-1}, ..., cD_1]
-
Step 4: Process Coefficients (Depending on Purpose)
- Denoising: Apply thresholding to detail coefficients
- Soft Thresholding: $\hat{d} = \text{sign}(d) \cdot \max(|d| - \lambda, 0)$ — smooth, small bias
- Hard Thresholding: $\hat{d} = d \cdot \mathbf{1}(|d| > \lambda)$ — preserves large coefficients, but discontinuous
- Universal Threshold: $\lambda = \sigma \sqrt{2\ln N}$, where $\sigma$ is estimated from Level 1 detail coefficients via MAD: $\hat{\sigma} = \text{MAD}(cD_1) / 0.6745$
- Compression: Keep only the $K$ largest coefficients, set the rest to zero → reconstruct. Compression ratio = $N/K$
- Feature extraction: Compute energy at each level $E_j = \sum_k |cD_j[k]|^2$ as classification features
- Denoising: Apply thresholding to detail coefficients
-
Step 5: Reconstruct
% MATLAB x_rec = waverec(C_modified, L, 'db4'); # Python x_rec = pywt.waverec(coeffs_modified, 'db4')
Concrete Example: ECG Denoising
Signal: ECG, $f_s = 360$ Hz, contaminated with 60 Hz power line interference and high-frequency EMG noise
Method: db4, 4-level decomposition
| Level | Band (Hz) | Corresponding Component | Processing |
|---|---|---|---|
| cD1 | 90-180 | High-frequency EMG noise | Soft threshold (nearly all removed) |
| cD2 | 45-90 | 60 Hz power line + some EMG | Soft threshold |
| cD3 | 22.5-45 | Main energy of QRS complex | Preserve |
| cD4 | 11.25-22.5 | P wave, T wave | Preserve |
| cA4 | 0-11.25 | Baseline wander | Remove or preserve (depending on needs) |
Result: The reconstructed ECG clearly preserves QRS, P, and T waveforms, with high-frequency noise and baseline wander effectively removed.
Application Scenarios
- Image compression — JPEG 2000: Uses CDF 9/7 biorthogonal wavelets for 2D DWT on images, followed by quantization and entropy coding of wavelet coefficients. JPEG 2000 achieves 20-30% higher compression ratio than JPEG (which uses DCT) at the same quality. Smooth regions in images → wavelet coefficients approach zero (thanks to vanishing moments) → high compression ratio
- ECG/EEG denoising: As in the example above. Soft thresholding + universal threshold criterion is the standard method in medical signal processing
- Seismic data compression: Seismic exploration produces TB-scale data. DWT compression can reduce storage by 10-50x while preserving key geological structure features
- Edge detection: Edges in images = abrupt brightness changes → manifest as large values in DWT high-frequency (detail) coefficients. Haar or db2 wavelets are particularly suitable for detecting step-like edges
Pitfalls and Limitations
- Not shift-invariant: Shifting the same signal by one sample can completely change the DWT coefficients. The cause is the downsampling operation. This leads to: (a) denoising results may vary depending on the signal's starting point; (b) reconstructed signals may exhibit Gibbs-like ringing at edges
- Solution — Stationary Wavelet Transform (SWT): No downsampling → shift-invariant, but high redundancy ($J \times N$ coefficients vs. DWT's $N$)
- Choosing the wrong wavelet can introduce artifacts: For example, using the discontinuous Haar wavelet on smooth signals → blocky artifacts in reconstruction
- Only octave band decomposition: Frequency resolution decreases exponentially with level. Level 1 covers $[f_s/4, f_s/2]$, Level 5 covers $[f_s/64, f_s/32]$ — no "intermediate" bandwidth options
- Boundary handling: Signal boundaries need extension at each decomposition level (symmetric, periodic, or zero-padding), and different extension methods affect coefficients near boundaries
When Not to Use DWT? Alternatives
| Scenario | Problem | Alternative |
|---|---|---|
| Need a continuous time-frequency representation | DWT only has discrete octave bands | CWT → Section 5.4 |
| Need shift invariance | Downsampling destroys shift invariance | SWT (Stationary WT, no downsampling) or DTCWT (Dual-Tree CWT) |
| Need uniform frequency band splitting | Octave decomposition is not uniform | Wavelet Packet (can split bands arbitrarily) |
| Need data-driven decomposition | DWT basis is predefined | EMD → Section 5.6 or VMD |
References: [1] Mallat, S., A Theory for Multiresolution Signal Decomposition: The Wavelet Representation, IEEE Trans. PAMI, 11(7):674-693, 1989. [2] Daubechies, I., Orthonormal Bases of Compactly Supported Wavelets, Comm. Pure Appl. Math., 41(7):909-996, 1988. [3] Donoho, D.L. & Johnstone, I.M., Ideal Spatial Adaptation by Wavelet Shrinkage, Biometrika, 81(3):425-455, 1994. [4] Mallat, S., A Wavelet Tour of Signal Processing, 3rd ed., Academic Press, 2008.
Interactive: Haar DWT Decomposition
Decompose the signal layer by layer into approximation (low-frequency) and detail (high-frequency) coefficients using the Haar wavelet.
5.6 Empirical Mode Decomposition (EMD) / Hilbert-Huang Transform (HHT)
Data-driven adaptive decomposition — let the data tell you the basis
Why does this matter? Because Fourier and wavelet methods both use predefined basis functions, but some signals (ocean waves, seismic events, biological rhythms) don't look like sinusoids or known wavelets. EMD lets the data decide the basis — it is a unique tool for nonlinear, non-stationary signals.
Previously: All methods so far use predefined bases (sinusoids, wavelets). But what if the signal doesn't resemble any known basis? EMD lets the data decide the basis.
Learning Objectives
- Understand every step of the EMD Sifting Process
- Master the definition and physical meaning of IMF (Intrinsic Mode Function)
- Understand how the Hilbert-Huang Transform obtains a time-frequency representation from IMFs
- Recognize the Mode Mixing problem and the improvements of EEMD/CEEMDAN
One-Sentence Summary
EMD makes no assumptions about basis functions — it lets the data tell you what components to decompose into. No sinusoids, no wavelets — the decomposed components take whatever shape the signal has.
Pain Point: Predefined Bases Are Not Flexible Enough
Fourier analysis assumes the signal is composed of sinusoids; wavelet analysis uses a preselected mother wavelet as the basis. For nonlinear, non-stationary signals, these predefined basis functions may be fundamentally unsuitable:
- Ocean waves: Real ocean waves are not sinusoidal — crests are sharp and troughs are flat (Stokes waves). Fourier analysis produces many spurious harmonics (2x, 3x frequency...) that do not represent real physical components
- Biological signals: Heart rate variability (HRV) exhibits nonlinear, non-stationary characteristics. The LF/HF ratio in Fourier spectra is widely used, but its assumption (stationarity) is frequently violated
- Mechanical faults: The impact response from bearing damage is modulated by a nonlinear spring-damper system — the waveform is neither sinusoidal nor resembles any standard wavelet
Fundamental question: Real-world signals don't "owe" us a sinusoidal decomposition. Why not let the data decide how to decompose?
Origin
Norden E. Huang et al. (1998) proposed EMD and HHT at NASA Goddard Space Flight Center. This paper, published in Proceedings of the Royal Society A, is one of the most cited papers in signal processing over the past 25 years (over 15,000 citations).
Inspiration: Huang was an oceanographer, and his core motivation came from ocean wave analysis — ocean waves are highly nonlinear and non-stationary. The harmonics produced by Fourier analysis are mathematical artifacts that do not represent real physical processes. He wanted a method whose decomposed components directly reflect physical phenomena.
Huang once said: "EMD is like a surgeon's scalpel — you don't need it for healthy tissue, but when you need it, nothing else will do."
Principle: Sifting Process
Intuition: Like panning for gold — place the signal on a sieve and shake, the topmost oscillations (highest-frequency components) are sifted out first, then the next layer, and so on until only the slow trend remains.
Complete Steps of Sifting
- Find local extrema: Identify all local maxima and local minima of the signal $x(t)$
- Construct upper envelope: Apply cubic spline interpolation to all local maxima → obtain the upper envelope $e_{\max}(t)$
- Construct lower envelope: Apply cubic spline interpolation to all local minima → obtain the lower envelope $e_{\min}(t)$
- Compute mean envelope: $m(t) = \frac{e_{\max}(t) + e_{\min}(t)}{2}$
- Remove trend: $h(t) = x(t) - m(t)$
- Check IMF conditions:
- Condition 1: The difference between the number of extrema (maxima + minima) and zero crossings is $\leq 1$
- Condition 2: The mean envelope is approximately zero at all times
- If satisfied: $h(t)$ is the first IMF → $\text{IMF}_1(t) = h(t)$
- If not satisfied: Replace $x(t)$ with $h(t)$, return to Step 1 and continue sifting
- Remove the first IMF: $r_1(t) = x(t) - \text{IMF}_1(t)$ (residual)
- Use the residual as new input: $x(t) \leftarrow r_1(t)$, return to Step 1 to extract $\text{IMF}_2$
- Repeat until the residual $r_K(t)$ is monotonic (no more oscillations) or has $\leq 1$ extremum
EMD Decomposition Result
$$x(t) = \sum_{i=1}^{K} \text{IMF}_i(t) + r_K(t)$$$\text{IMF}_i$: the $i$-th Intrinsic Mode Function (ordered from high to low frequency); $r_K$: final residual (trend)
Intuition of IMF: Each IMF can be viewed as a "nearly narrowband" oscillatory component of the signal. Its amplitude and frequency can both slowly vary over time (something Fourier analysis cannot handle), but at each moment there is only one dominant frequency.
Expand: Details of Sifting Stopping Criteria
When does the sifting iteration stop (when do we declare $h$ as an IMF)? Common criteria include:
Cauchy-type criterion (Huang 1998):
$$\text{SD} = \frac{\sum_t |h_{k-1}(t) - h_k(t)|^2}{\sum_t h_{k-1}^2(t)} < \epsilon$$$\epsilon$ typically set to 0.2~0.3. Drawback: purely based on numerical convergence, may lead to over-sifting.
Rilling's three criteria (2003):
- $|m(t)/a(t)| < \theta_1$ holds at least $(1-\alpha)$ fraction of time points ($a(t) = (e_{\max} - e_{\min})/2$ is the local amplitude)
- $|m(t)/a(t)| < \theta_2$ holds at all time points
- Typical values: $\theta_1 = 0.05$, $\theta_2 = 0.5$, $\alpha = 0.05$
This criterion is more robust, avoiding over-sifting that distorts the IMF.
Hilbert-Huang Transform (HHT)
After obtaining the IMFs, apply the Hilbert transform to each IMF to extract instantaneous frequency and amplitude:
- Apply the Hilbert transform to $\text{IMF}_i(t)$ → obtain the analytic signal $z_i(t) = \text{IMF}_i(t) + j\,\hat{H}[\text{IMF}_i](t)$
- Instantaneous amplitude: $a_i(t) = |z_i(t)|$
- Instantaneous phase: $\phi_i(t) = \arg[z_i(t)]$
- Instantaneous frequency: $\omega_i(t) = d\phi_i/dt$
Hilbert Spectrum
$$H(\omega, t) = \sum_{i=1}^{K} a_i(t)\, \delta(\omega - \omega_i(t))$$Each IMF contributes an energy value at each time instant to the position of its instantaneous frequency → time-frequency representation
Unlike STFT/CWT, the frequencies in the Hilbert spectrum are not on a fixed grid but are continuously varying curves over time. This is a more natural representation for non-stationary signals.
How to Use
Concrete Example: Ocean Wave Data Analysis
Signal: 20-minute wave height time series from an ocean site, $f_s = 2$ Hz (sampling interval 0.5 seconds)
EMD result: Decomposed into 8 IMFs + residual
| IMF | Typical Period | Physical Correspondence |
|---|---|---|
| IMF 1-2 | 1-3 seconds | Wind waves — short waves generated by local wind |
| IMF 3 | 3-5 seconds | Transition zone — wind waves transitioning to swell |
| IMF 4-6 | 5-15 seconds | Swell — long waves from distant storms |
| IMF 7-8 | 30 seconds to minutes | Long-period waves / infragravity waves |
| Residual $r$ | $> 10$ minutes | Tidal and mean water level changes |
Advantage: EMD automatically separates waves of different physical origins without needing predefined frequency boundaries. Each IMF's waveform reflects the actual wave shape (non-sinusoidal), and the Hilbert spectrum reveals time-varying characteristics of instantaneous frequency and amplitude.
Application Scenarios
- Ocean engineering (wave analysis): The original application domain of EMD. Analyzing typhoon waves, rogue waves, and wave-structure interactions. Recommended by the IEEE Oceanic Engineering Society for nonlinear wave analysis
- Structural health monitoring: The vibration response of bridges and buildings under earthquakes or strong winds is nonlinear and non-stationary. EMD can separate different vibrational modes and track natural frequency changes as damage progresses
- Biomedical signals (heart rate variability analysis): In HRV analysis, EMD can extract respiration-related components (high-frequency IMFs) and blood pressure regulation components (low-frequency IMFs) without assuming stationarity
- Financial time series: Trend extraction from stock indices. Low-frequency IMFs + residual = long-term trend; high-frequency IMFs = short-term fluctuations and noise. More adaptive than moving averages
Pitfalls and Limitations
-
Mode Mixing: The most serious problem. Significantly different frequency components mix into a single IMF, or the same physical component is split across multiple IMFs.
Cause: Intermittent signals (a frequency component that appears and disappears) cause problems with envelope interpolation during the sifting process.
Solutions:
- EEMD (Ensemble EMD, Wu & Huang 2009): Add white noise → EMD → average multiple results. The noise helps "break" the mixing. Typical settings: noise amplitude = 0.1-0.2 times standard deviation, ensemble size = 100-500
- CEEMDAN (Complete EEMD with Adaptive Noise, Torres et al. 2011): Add noise to the residual at each step rather than to the original signal → cleaner decomposition, less residual noise
- No rigorous mathematical theory: Unlike Fourier which has completeness and Parseval's theorem, and unlike wavelets which have frame theory. The convergence, uniqueness, and stability of EMD have no rigorous mathematical proofs
- End effects: Cubic spline interpolation at signal boundaries requires extrapolation → envelopes may diverge or become unreasonable. Common solutions: mirror extension, extrapolating extrema points
- Non-unique results: Different sifting stopping criteria, different envelope interpolation methods, different boundary handling → results may differ. This is problematic for scientific research requiring reproducibility
- Computational cost: Each sifting iteration requires finding extrema, interpolation, and subtraction → overall computational cost is not small, especially EEMD which requires hundreds of repetitions
When Not to Use EMD? Alternatives
| Scenario | Problem | Alternative |
|---|---|---|
| Need rigorous mathematical framework and reproducibility | EMD lacks theoretical guarantees | CWT or DWT (have complete mathematical theory) |
| Need stable, deterministic decomposition | EMD results depend on parameters | VMD (Variational Mode Decomposition) — replaces sifting with optimization, producing unique and stable results |
| Signal is linear and stationary | EMD's advantages are not significant, and it is more complex | FFT / Welch (simple and direct) |
| Need efficient real-time processing | EMD (especially EEMD) is too slow | STFT or DWT |
VMD (Variational Mode Decomposition, Dragomiretskiy & Zosso 2014): Reformulates mode decomposition as a constrained variational optimization problem. Advantages: unique results, not affected by initialization, can specify the number of modes. Disadvantage: requires a preset mode count $K$ (but can be automatically selected using residual criteria). In many applications, VMD is replacing EMD.
References: [1] Huang, N.E. et al., The Empirical Mode Decomposition and the Hilbert Spectrum for Nonlinear and Non-Stationary Time Series Analysis, Proc. R. Soc. Lond. A, 454:903-995, 1998. [2] Wu, Z. & Huang, N.E., Ensemble Empirical Mode Decomposition, Adv. Adaptive Data Analysis, 1(1):1-41, 2009. [3] Torres, M.E. et al., A Complete Ensemble Empirical Mode Decomposition with Adaptive Noise, IEEE ICASSP, 2011. [4] Dragomiretskiy, K. & Zosso, D., Variational Mode Decomposition, IEEE Trans. Signal Process., 62(3):531-544, 2014.
Interactive: EMD Sifting Process Animation
Observe the EMD process of progressively "sifting" out each IMF: find extrema → envelopes → mean → subtract → converge.
5.7 Synchrosqueezing Transform (SST)
Sharpening CWT — combining wavelet robustness with near-WVD resolution
Why does this matter? Because CWT time-frequency plots are too blurry and WVD has cross-terms. SST combines the advantages of both — the stability of wavelets plus near-WVD sharpness. It is one of the most important advances in time-frequency analysis in the past decade.
Previously: EMD in Section 5.6 lacks mathematical theory and produces non-unique results. SST combines the stability of CWT with the sharpness of WVD, and is an important advance of the past decade.
Learning Objectives
- Understand the core idea of SST: using instantaneous frequency estimates to "squeeze" CWT energy to the correct location
- Recognize the historical context of Frequency Reassignment and SST's improvement
- Compare the resolution and cross-term characteristics of SST vs. CWT vs. WVD
- Understand SST's sensitivity to noise and the 2nd-order SST extension
One-Sentence Summary
SST "squeezes" the blurry CWT time-frequency plot into sharp lines — maintaining the robustness and cross-term-free nature of wavelets while achieving near-WVD time-frequency concentration. Like refocusing a blurry photograph.
Pain Point: CWT Is Blurry, WVD Has Ghost Artifacts
So far, we face a trilemma:
| Method | Resolution | Cross-Terms | Robustness |
|---|---|---|---|
| STFT / Spectrogram | Heisenberg limited ✗ | None ✓ | High ✓ |
| WVD | Perfect ✓ | Severe ✗ | Low ✗ |
| CWT Scalogram | Multi-resolution (but still blurry) | None ✓ | High ✓ |
Is there a way to get the best of both worlds — no cross-terms like CWT, yet sharp like WVD? SST gives a near-perfect answer.
Origin
Frequency Reassignment (Auger & Flandrin, 1995): The earliest "sharpening" idea. They observed that STFT/CWT energy is "smeared" across the width of the window/wavelet, but the energy can be "relocated" to the correct position using the instantaneous frequency of each coefficient. The problem: the result after reassignment cannot be inverted (not invertible).
Ingrid Daubechies, Jianfeng Lu & Hau-Tieng Wu (2011): Proposed SST in Applied and Computational Harmonic Analysis. Their key contribution was designing an invertible reassignment method — squeezing only in the frequency direction (scale direction), leaving the time direction unchanged. This guarantees invertibility while obtaining a sharp time-frequency representation.
Daubechies made foundational contributions to both wavelet theory (CWT/DWT, Sections 5.4-5.5) and SST — she is one of the most influential scholars in the field of time-frequency analysis.
Principle
Intuition: Imagine the CWT scalogram as a blurry photo — energy spreads like watercolor across a range of scales. What SST does is: for each coefficient, ask "what is your true frequency?" (using instantaneous frequency estimation), then "squeeze" the energy from the current scale to the corresponding true frequency. Like pushing each drop of watercolor back to where it should be.
Step 1: Compute CWT
Step 2: Estimate the Instantaneous Frequency for Each Coefficient
$W_x^{(\psi')}$: CWT computed using $\psi'(t)$ (the derivative of the mother wavelet)
Intuition: $\omega(a,b)$ is the "true frequency" perceived by the CWT coefficient at scale $a$ and time $b$. If the signal's true frequency at this location is $\omega_0$, then regardless of what value $a$ takes (as long as $W_x(a,b)$ is not too small), $\omega(a,b) \approx \omega_0$.
Step 3: Synchrosqueezing
Redistribute CWT energy from scale $a$ to instantaneous frequency $\omega(a,b)$:
For each frequency bin $\omega_l$, sum up CWT coefficients from all scales whose instantaneous frequency points to it
Expand: Why Is SST Invertible While Reassignment Is Not?
Frequency Reassignment (Auger-Flandrin): Redistributes energy simultaneously in both time and frequency directions. This is a many-to-one mapping (multiple time-frequency points may map to the same point) → not invertible.
SST: Redistributes only in the frequency (scale) direction, keeping the time position $b$ unchanged. Therefore, at each fixed $b$, the squeezing operation is one-dimensional.
Invertibility Theorem (Daubechies et al. 2011): For a signal composed of $K$ components with separated instantaneous frequencies, the SST result $T_x(\omega, b)$ can be uniquely inverted back to the original signal:
$$x(t) = \text{Re}\left[\frac{1}{C_\psi}\int T_x(\omega, t)\, d\omega\right]$$The prerequisite is that the instantaneous frequencies of components do not overlap at any time (separation condition). $\blacksquare$
Expand: SST Effect Analysis for a Single-Component Chirp
Consider $x(t) = A(t)\, e^{j\phi(t)}$ with instantaneous frequency $\omega_0(t) = \phi'(t)$.
CWT response at scale $a$: $W_x(a, b) \approx A(b)\sqrt{a}\, \hat{\Psi}^*(a\omega_0(b))\, e^{j\phi(b)}$
Instantaneous frequency estimate: $\omega(a, b) = \omega_0(b)$ (independent of $a$!)
Therefore SST squeezes energy from all scales to $\omega_0(b)$ → forming a sharp line on the frequency axis, perfectly tracking the instantaneous frequency.
CWT scalogram: Energy spreads in a band-like region around $\omega_0(b)$ (due to the mother wavelet's width).
SST: Energy concentrates into a line at $\omega_0(b)$.
The sharpening degree is equivalent to WVD, but without cross-terms. $\blacksquare$
SST vs CWT vs WVD: Visual Comparison
| Property | CWT Scalogram | WVD | SST |
|---|---|---|---|
| Single-component appearance | Blurry band | Sharp line | Sharp line |
| Multi-component cross-terms | None | Severe | None |
| Invertibility | Invertible | Invertible | Invertible (under separation condition) |
| Can take negative values | No ($|W|^2 \geq 0$) | Yes | Yes (complex-valued) |
| Computation | $O(N \cdot J \log N)$ | $O(N^2)$ | $O(N \cdot J \log N)$ (same order as CWT) |
How to Use
- Choose a mother wavelet: Typically Morlet (analytic wavelet), because SST requires precise instantaneous frequency estimation, and Morlet's narrowband characteristics are most suitable
- Compute CWT: Same as Section 5.4
- Compute instantaneous frequency map: $\omega(a,b)$ — requires computing both $W_x$ and $W_x^{(\psi')}$ simultaneously
- Perform Synchrosqueezing: Redistribute CWT coefficients to the corresponding frequency bins
- Visualize: Plot $|T_x(\omega, b)|^2$
Application Scenarios
- Component separation of multi-component non-stationary signals: For example, two chirp signals crossing in the time-frequency plane — STFT sees a blurry blob, WVD sees cross-term interference, but SST can see two sharp crossing lines. Combined with Ridge Extraction algorithms, each component can be automatically separated and inverted for reconstruction
- Geology (seismic wave mode separation): Different modes of seismic surface waves (fundamental mode, higher-order modes) appear as different $f$-$v$ relationships on dispersion curves. SST's sharp time-frequency plot makes mode separation more precise, aiding in subsurface structure inversion
- Heart rate variability (HRV) analysis: SST can precisely track the time-varying characteristics of breathing frequency (which varies with activity, sleep stages, etc.), localizing instantaneous respiratory frequency more sharply than CWT
- Mechanical fault diagnosis: Gear mesh frequency tracking under variable-speed conditions. SST sharpens the blurry frequency trajectories in CWT into precise lines, facilitating RPM computation and anomaly detection
Pitfalls and Limitations
- Sensitive to noise: The instantaneous frequency estimate $\omega(a,b) = -\text{Im}[\partial_b W / W]$ becomes unstable in regions where $|W|$ is small (low SNR). Noise causes $\omega$ estimates to jump erratically → SST results show scattered speckles. In practice, an energy threshold must be set to ignore regions where $|W|$ is too small
- Not suitable for rapid frequency jumps: SST assumes instantaneous frequency varies slowly locally. For sudden frequency jumps (e.g., FSK modulation), the sharpening effect of SST degrades at the jump points
- 2nd-order SST (Oberlin et al. 2015): An improved version. Uses second-order instantaneous frequency estimation (accounting for the rate of change of $\omega$) → better results for chirps and other signals with rapidly changing frequencies. But computational cost is higher and implementation is more complex
- Separation condition: When components' instantaneous frequencies coincide at certain times, SST cannot separate them (unlike WVD — which theoretically can, but with cross-terms)
- Slightly higher computation than CWT: Requires additionally computing $W_x^{(\psi')}$ (one CWT with the derivative wavelet) and instantaneous frequency mapping. Approximately 2-3 times that of CWT
When Not to Use SST? Alternatives
| Scenario | Problem | Alternative |
|---|---|---|
| Only need a rough time-frequency view | SST computation is much larger than STFT | STFT Spectrogram (much faster, usually sufficient) |
| Very low SNR (< 0 dB) | Instantaneous frequency estimation is unstable | CWT Scalogram (blurry but stable) or Multitaper Spectrogram |
| Sudden frequency jumps | 1st-order SST assumptions don't hold | 2nd-order SST or STFT (short windows better track rapid changes) |
| Nonlinear signals, don't want preset wavelets | SST still depends on mother wavelet choice | EMD / HHT → Section 5.6 |
| Need precise energy distribution | SST results are complex-valued, not positive energy | Reassigned Spectrogram (Auger-Flandrin) |
Part V: Time-Frequency Analysis Methods Overview
Positioning and use cases for the seven methods:
| Method | Core Property | Best Suited For | Biggest Limitation |
|---|---|---|---|
| 5.1 STFT | Fixed window, fast computation | First choice for general time-frequency analysis | Fixed resolution |
| 5.2 WVD | Highest resolution | Precise single-component analysis | Multi-component cross-terms |
| 5.3 Cohen's | Unified framework, tunable kernel | Theoretical analysis and comparison | High computational cost |
| 5.4 CWT | Multi-resolution | Multi-scale time-frequency analysis | Redundant, blurry |
| 5.5 DWT | $O(N)$, no redundancy | Compression, denoising, feature extraction | Only octave bands |
| 5.6 EMD | Data-driven, adaptive | Nonlinear non-stationary signals | Lacks theory, mode mixing |
| 5.7 SST | Sharpened CWT, no cross-terms | Multi-component non-stationary separation | Noise-sensitive |
Selection guide: Start with STFT for quick observation → if multi-resolution is needed, use CWT → if sharper results are needed, use SST → if cross-terms are a problem, avoid WVD → if the signal is highly nonlinear, try EMD → if compression/denoising is needed, use DWT.
References: [1] Daubechies, I., Lu, J. & Wu, H.-T., Synchrosqueezed Wavelet Transforms: An Empirical Mode Decomposition-like Tool, Appl. Comp. Harm. Anal., 30(2):243-261, 2011. [2] Auger, F. & Flandrin, P., Improving the Readability of Time-Frequency and Time-Scale Representations by the Reassignment Method, IEEE Trans. Signal Process., 43(5):1068-1089, 1995. [3] Oberlin, T., Meignen, S. & Perrier, V., Second-Order Synchrosqueezing Transform or Invertible Reassignment?, IEEE Trans. Signal Process., 63(5):1335-1344, 2015. [4] Thakur, G. et al., The Synchrosqueezing Algorithm for Time-Varying Spectral Analysis, Signal Processing, 93(5):1079-1094, 2013.
6.1 Advanced FFT Algorithms
Learning Objectives
- Distinguish the applicable scenarios for Split-Radix, Bluestein, and Goertzel
- Determine when Goertzel is more efficient than FFT and when it is not
- Use the Bluestein identity to understand how arbitrary-length DFT can be converted into convolution
Why does this matter? Because real-world data lengths are not always powers of 2, and sometimes you only need the result for a single frequency — computing a full FFT would be wasteful.
Previously:Part V established the theory of time-frequency analysis. Part VI returns to engineering practice — starting with advanced FFT implementation techniques.
- Goertzel Algorithm — Gerald Goertzel, 1958. Designed to compute only a single frequency bin of the DFT.
- Bluestein Algorithm (Chirp-Z Transform) — Leo Bluestein, 1970. Cleverly converts an arbitrary-length DFT into a convolution problem.
- Split-Radix FFT — Duhamel & Hollmann, 1984. Combines the advantages of Radix-2 and Radix-4, saving approximately 20% of complex multiplications.
Principles
Intuition: The Radix-2 FFT is a "general-purpose tool," but more efficient alternatives exist for special scenarios:
- Split-Radix: Simultaneously uses a length-N/2 DFT (even part) and two length-N/4 DFTs (odd part), requiring approximately 20% fewer multiplications than pure Radix-2.
- Bluestein: Works for any N, relying on an algebraic identity to convert the DFT into convolution.
- Goertzel: Computes only a single bin using a second-order recursion (IIR), with computational cost O(N).
Formulas:
Bluestein Identity:
kn = \tfrac{1}{2}\bigl[k^2 + n^2 - (k-n)^2\bigr]
Substituting into the DFT twiddle factor $W_N^{kn}$:
$$X[k] = W_N^{k^2/2} \sum_{n=0}^{N-1} \bigl(x[n]\,W_N^{n^2/2}\bigr)\,W_N^{-(k-n)^2/2}$$
Inside the parentheses, $x[n]$ is multiplied by a known sequence (chirp), and the outer operation is a convolution with another known sequence. Three FFTs (zero-padded to the next power of 2) can complete an arbitrary-length N DFT.
Goertzel Recursion:
s[n] = x[n] + 2\cos(2\pi k/N)\,s[n-1] - s[n-2], \quad s[-1]=s[-2]=0
X[k] = s[N-1] - W_N^k\,s[N-2]
Full Derivation: Bluestein Identity
DFT definition: $X[k] = \sum_{n=0}^{N-1} x[n]\,W_N^{kn}$, where $W_N = e^{-j2\pi/N}$.
Using the identity $kn = \tfrac{1}{2}[k^2 + n^2 - (k-n)^2]$ (verify by expanding the right side):
$$W_N^{kn} = W_N^{k^2/2}\,W_N^{n^2/2}\,W_N^{-(k-n)^2/2}$$
Substituting into the DFT:
$$X[k] = W_N^{k^2/2} \sum_{n=0}^{N-1} \bigl[x[n]\,W_N^{n^2/2}\bigr]\,W_N^{-(k-n)^2/2}$$
Let $a[n] = x[n]\,W_N^{n^2/2}$ and $b[n] = W_N^{-n^2/2}$. The summation term is the value of the convolution of $a$ and $b$ at point $k$. Zero-pad $a$ and $b$ to length $\geq 2N-1$ (next power of 2), and three FFTs can complete the convolution.
Full Derivation: Goertzel Algorithm
DFT bin k: $X[k] = \sum_{n=0}^{N-1} x[n]\,W_N^{kn}$.
Observe that $W_N^{-kN} = 1$, so:
$$X[k] = W_N^{-kN}\sum_{n=0}^{N-1} x[n]\,W_N^{kn} = \sum_{n=0}^{N-1} x[n]\,W_N^{-k(N-n)}$$
This is equivalent to passing $x[n]$ through a filter with system function $H(z)=\frac{1}{1-W_N^{-k}z^{-1}}$ and taking the output at $n=N$.
However, this is a first-order complex recursion. To avoid complex multiplications, multiply both numerator and denominator by $(1-W_N^{k}z^{-1})$:
$$H(z) = \frac{1 - W_N^{k}z^{-1}}{1 - 2\cos(2\pi k/N)z^{-1} + z^{-2}}$$
The denominator is a second-order recursion with real coefficients (requiring only real multiplications). After running N steps, the numerator (one complex multiplication) yields $X[k]$.
Total: N real multiplications + 1 complex multiplication = O(N) per bin.
How to Use
- N is a power of 2 → Use Radix-2 or Split-Radix FFT directly.
- N is not a power of 2 → Zero-pad to the next power of 2 (simplest), or use Bluestein (when an exact N is strictly required).
- Only K bins needed (K << N) → Use K Goertzel computations. More efficient than a full FFT when $K < \log_2 N$.
Concrete Example: DTMF Dial Tone Detection
DTMF (Dual-Tone Multi-Frequency) only needs to detect 8 frequencies: 697, 770, 852, 941, 1209, 1336, 1477, 1633 Hz.
// Sample rate 8000Hz, 205 samples per segment (~25.6ms)
// Full FFT (N=256): 256 × log₂(256) = 256 × 8 = 2048 complex multiplications
// 8 Goertzel runs: 8 × 205 = 1640 real multiplications + 8 complex multiplications
// Goertzel saves ~90% of computation (accounting for complex vs real differences)
for each freq in [697, 770, 852, 941, 1209, 1336, 1477, 1633]:
k = round(freq * N / fs)
coeff = 2 * cos(2 * pi * k / N)
s1, s2 = 0, 0
for sample in segment:
s0 = sample + coeff * s1 - s2
s2 = s1
s1 = s0
power = s1*s1 + s2*s2 - coeff*s1*s2
if power > threshold:
tone_detected(freq)
Application Scenarios
- DTMF telephone dial tone detection: Telephone switches use Goertzel to detect 8 frequencies. 205 samples per segment @8kHz, processing ~39 segments per second.
- Power quality measurement: Smart meters only need to measure the 50Hz fundamental and a few harmonics (100, 150, 200... Hz). Goertzel saves >10x power compared to a full FFT, extending battery life.
- Non-standard-length data: Some communication protocol frame lengths are not powers of 2 (e.g., LTE SC-FDMA uses multiples of 12). Bluestein/CZT can handle these directly.
- Zero-padding to the next power of 2 is usually the simplest and most effective approach. Unless there are strict constraints on N (memory, latency), Bluestein is unnecessary.
- Goertzel's efficiency advantage disappears when $K > \log_2 N$; in that case, a full FFT is faster.
- Split-Radix is more complex to implement than Radix-2. Modern CPUs have SIMD-optimized FFT libraries (FFTW, MKL), making manual implementations difficult to beat.
- If you can freely choose the data length, simply use a power of 2 — no need for Bluestein.
- If you need the full spectrum (all N bins), Goertzel offers no advantage.
- Alternative: Modern FFT libraries (FFTW) have built-in Mixed-Radix algorithms that efficiently handle any length (as long as N has small prime factors). Bluestein is truly needed only when N is prime.
✅ Quick Check
Q1: DTMF detection needs only 8 frequencies. Compared to a full FFT (N=256), is Goertzel cheaper? By how much?
Show answer
Goertzel: 8×N=2048 operations. FFT: (N/2)log₂N=1024 operations. In this case, FFT is actually faster! Goertzel only saves when K < log₂N.
Q2: Data length N=300 (not a power of 2) — what is the simplest approach?
Show answer
Zero-pad to 512 (the next power of 2) and use a standard Radix-2 FFT. Much simpler than the Bluestein algorithm.
Interactive: Goertzel vs FFT Efficiency Comparison
Compare the computational cost of Goertzel (computing only K frequencies) versus a full FFT. Goertzel is more efficient when K < log2N; otherwise FFT is more efficient.
6.2 Numerical Precision
Learning Objectives
- Compute the SQNR of floating-point FFT and its relationship to N and precision
- Explain overflow issues in fixed-point FFT and two scaling strategies
- Choose floating-point/fixed-point precision based on SQNR requirements
Why does this matter? Because when implementing FFT on embedded systems and FPGAs, numerical precision directly determines system performance — if you do not understand fixed-point overflow and rounding errors, your spectral results will be garbage.
Previously:6.1 introduced different FFT algorithms. But on real hardware (FPGAs, embedded systems), floating-point numbers are not true real numbers — numerical precision is an engineering problem you must face.
Principles
Intuition: FFT is a cascade of butterfly operations. Each butterfly involves multiplication and addition, and each step introduces rounding error. $\log_2 N$ butterfly stages = $\log_2 N$ rounds of error accumulation.
Floating-point FFT error bound:
$$\|X_{\text{computed}} - X_{\text{exact}}\|_{\text{rms}} \;\leq\; C\,\varepsilon_m\,\sqrt{\log_2 N}\;\|x\|_{\text{rms}}$$
where $\varepsilon_m$ is the machine epsilon: $\approx 1.19 \times 10^{-7}$ for float32, $\approx 2.22 \times 10^{-16}$ for float64.
The problem with fixed-point FFT — Overflow:
The butterfly operation $a + b$ may exceed the representable range of fixed-point numbers. Two solution strategies:
- Right-shift 1 bit per stage (Convergent Scaling): After each butterfly stage, all values are right-shifted by 1 bit (divided by 2). This guarantees no overflow, but loses 1 bit of precision per stage. After $\log_2 N$ stages, $\log_2 N$ bits are lost.
- Block Floating Point (BFP): Check the maximum value at each stage and scale only when necessary. Average loss is less than per-stage right-shifting.
Concrete numbers:
| Configuration | N | SQNR (dB) | Notes |
|---|---|---|---|
| Float64 | 1024 | ≈ 281 | Sufficient for virtually any application |
| Float32 | 1024 | ≈ 131 | Sufficient for most consumer-grade applications |
| Fixed 32-bit | 1024 | ≈ 132 | Mainstream FPGA choice |
| Fixed 16-bit (per-stage scaling) | 1024 | ≈ 72 | 10 butterfly stages lose 10 bits, leaving 6 effective bits |
| Fixed 16-bit (BFP) | 1024 | ≈ 84 | ~12dB better than fixed scaling |
Derivation: SQNR of Fixed-Point FFT with Per-Stage Scaling
Quantization noise power for 16-bit fixed-point: $\sigma_q^2 = 2^{-2B}/12$; for B=16, $\sigma_q^2 \approx 3.2 \times 10^{-11}$.
Right-shifting by 1 bit per stage is equivalent to introducing one quantization. With $\log_2 N = 10$ stages, each stage introduces one independent quantization noise.
Total quantization noise power $\approx 10 \times \sigma_q^2$ (but note that scaling at each stage progressively reduces the effective bit width).
More precisely: after each right-shift, the effective bits decrease from B to B-1, ultimately leaving B - log₂N = 16 - 10 = 6 effective bits.
SQNR ≈ 6.02 × 6 + 1.76 + 10log₁₀(N/2) ≈ 36.1 + 1.76 + 27.1 ≈ 65-72 dB (depending on signal statistics).
How to Use
- Float64 (double): Scientific computing, offline analysis with no precision concerns. SQNR > 280dB.
- Float32 (single): GPU acceleration, most consumer electronics, sufficient for N ≤ 220. SQNR ≈ 130dB.
- Fixed 32-bit: FPGA radar/communications, equivalent to float32 in most cases.
- Fixed 16-bit: Low-power embedded systems. Must use BFP or per-stage scaling. SQNR 70-85dB.
Concrete Example: FPGA Radar FFT Design
// Specification: 1024-point FFT, 14-bit ADC input, SQNR > 75dB required
// Solution: Use 18-bit internal word width + Block Floating Point
// Check maximum value after each butterfly stage:
if (max_value > 0.5 * FULL_SCALE):
right_shift_all_by_1()
block_exponent += 1
// 10 butterfly stages, on average only 5-6 scalings needed (depends on signal)
// Effective bits: 18 - 6 = 12 bit → SQNR ≈ 6.02×12 + 1.76 ≈ 74 dB
// Conclusion: 18-bit BFP barely meets spec; recommend 20-bit for margin
Application Scenarios
- FPGA radar processor: Xilinx/AMD FPGA DSP48 slices natively support 18×25 bit multiplication. 1024-point FFT using 18-bit BFP, throughput 500MHz.
- Embedded audio DSP: Audio DSPs inside mobile phone chips (e.g., Qualcomm Hexagon) typically use 16-bit fixed-point processing. 256-point FFT with per-stage scaling, SQNR ≈ 80dB, sufficient for audio (human ear dynamic range ~96dB).
- 5G baseband processor: 4096-point FFT @30.72MHz sample rate. Uses 16-bit fixed-point + BFP, processing tens of thousands of OFDM symbols per second.
- Float32 with large N: When N > 220, accumulated errors can become significant. Scenarios such as astronomical observations and large seismic arrays should use float64.
- Forgetting to scale in fixed-point: The most common bug. After overflow, values "wrap around," producing completely wrong results that are hard to debug.
- Twiddle factor precision: In fixed-point FFT, the precision of the sin/cos lookup table also affects results. Typically 2-4 more bits than the data width are needed.
- Parseval's theorem check: Time-domain energy should equal frequency-domain energy. If the discrepancy exceeds expected precision, there is a precision problem.
- If your platform has a floating-point unit (FPU), use floating-point directly. Fixed-point is only worthwhile when there is no FPU or power consumption is extremely constrained.
- If precision requirements exceed 90dB, 16-bit fixed-point is insufficient — 24-bit or 32-bit is needed.
- Alternative: Floating-point IP cores are available on FPGAs (Xilinx FFT IP supports float32), at the cost of more resources.
✅ Quick Check
Q1: What is the approximate SQNR (in dB) of a 1024-point float32 FFT?
Show answer
float32 has ε_m ≈ 1.2×10⁻⁷, SQNR ≈ 20log₁₀(1/(ε_m√log₂1024)) ≈ 20log₁₀(1/(1.2×10⁻⁷×√10)) ≈ 131 dB.
Q2: In a fixed-point 16-bit FFT with 1-bit right-shift per stage, how many effective bits remain after 10 stages (1024-point)?
Show answer
16 - 10 = 6 effective bits, SQNR ≈ 6×6 = 36 dB. This is usually insufficient — block floating point or 32-bit fixed-point is needed.
6.3 Overlap-Add / Overlap-Save (OLA/OLS)
Learning Objectives
- Derive the correctness condition for OLA and OLS (FFT length >= L+M-1)
- Compute the latency and computational cost of segmented convolution
- Choose between OLA and OLS based on implementation requirements
Why does this matter? Because real-time audio processing, streaming filtering, and voice call noise reduction all require "process as you receive" — OLA/OLS is the standard approach for applying FFT to infinitely long streams.
Previously:6.2 addressed precision issues. But there is another implementation problem: real signals are streams (infinitely long), making it impossible to FFT everything at once. How do you process in segments yet get results identical to processing all at once?
Principles
Intuition: The length of a linear convolution = sum of the two sequence lengths - 1. After chopping a long sequence into segments, each segment's convolution result is longer than the original segment — the tail "overflows" into the next segment's range. OLA and OLS handle this overflow differently.
Overlap-Add (OLA)
- Split the input $x[n]$ into non-overlapping segments, each of length L.
- Zero-pad each segment to $N = L + M - 1$ (M = filter length).
- FFT convolution: $Y_i = \text{IFFT}\{\text{FFT}\{x_i, N\} \cdot H[k]\}$.
- Each segment's result has length N > L; the trailing M-1 points are overlap-added with the head of the next segment.
Overlap-Save (OLS)
- Split the input $x[n]$ into overlapping segments, each of length N, with adjacent segments overlapping by M-1 points.
- Perform length-N FFT convolution directly (no zero-padding needed).
- The first M-1 points of each segment's result are erroneous due to circular convolution — discard them.
- Keep the remaining L = N - M + 1 points, which are exactly identical to the linear convolution.
Derivation: Why Discarding the First M-1 Points in OLS Is Correct
Length-N FFT convolution computes circular convolution, but what we need is linear convolution.
The difference between circular and linear convolution only appears where the "tail wraps around to the head." Specifically:
Linear convolution $y_{\text{lin}}[n] = \sum_{m=0}^{M-1} h[m]\,x_i[n-m]$; when $n < M-1$, $x_i[n-m]$ may access samples before the current segment.
In circular convolution, $x_i[n-m]$ "wraps around" to the end of the segment (modulo N), producing incorrect results.
However, if we overlap each segment by M-1 points, the first M-1 samples of the current segment are the last M-1 samples of the previous segment. Circular convolution for $n \geq M-1$ does not need to access data outside the segment, and is therefore perfectly identical to linear convolution.
Therefore: discard the first M-1 points (the erroneous circular convolution part) and keep the remaining L = N - M + 1 points (the correct part).
How to Use
- Determine filter length M (length of the impulse response).
- Choose segment length L: Latency = L/fs seconds. Larger L → higher latency but better efficiency (FFT overhead is amortized).
- FFT length N = L + M - 1, rounded up to the next power of 2.
- Pre-compute $H[k] = \text{FFT}\{h[n], N\}$ (only needs to be done once).
- Per-segment processing:
- OLA: Take L points → zero-pad to N → FFT → multiply by H[k] → IFFT → overlap-add.
- OLS: Take N points (including M-1 overlap) → FFT → multiply by H[k] → IFFT → discard first M-1 points.
Concrete Example: Audio Reverb Effect
// Impulse response: concert hall IR, M = 4096 points (@ 48kHz = 85.3ms reverb tail) // Segment length: L = 4096 // FFT length: N = L + M - 1 = 8191 → round to 8192 (2^13) // Latency = L / fs = 4096 / 48000 = 85.3 ms // Computational cost comparison: // Direct convolution: L × M = 4096 × 4096 = 16,777,216 multiplications/segment // FFT convolution: 3 × (N/2)log₂N = 3 × 4096 × 13 = 159,744 multiplications/segment // (2 FFTs + 1 IFFT, each (N/2)log₂N complex multiplications) // Speedup: 16.8M / 160K ≈ 105× // To reduce latency: // L = 512 → latency = 10.7ms, N = 4608 → 8192 // Same computation per segment (same N), but more segments per second // Segments per second = 48000 / 512 = 93.75 // FFTs per second = 93.75 × 3 = 281 8192-point FFTs
Application Scenarios
- Audio effects: Reverb (IRs up to several seconds = hundreds of thousands of points), EQ (FIR with hundreds to thousands of taps). DAW software like Pro Tools uses OLA for real-time multi-track audio processing.
- Real-time Acoustic Echo Cancellation (AEC): Room impulse response ~100ms = 1600 points @16kHz. One segment every 10ms, using OLS for fast convolution.
- Radar Pulse Compression: Correlating long pulses with matched filters, using OLA/OLS for real-time processing.
- First segment of OLS: Requires prepending M-1 zeros to the input (since there is no "previous segment tail" available).
- FFT length too short: If N < L + M - 1, circular convolution will produce errors — results will have aliasing. This is the most common bug.
- Latency-efficiency tradeoff: Smaller L → lower latency but poorer efficiency (FFT overhead is large). Music performance requires <10ms latency, constraining L to 256-512.
- Numerical precision: Very long IRs (several seconds) may require very large N, where float32 precision may be insufficient. Partitioned IR techniques can be used (small N for low latency in early parts, large N for high efficiency in later parts).
- OLS has a simpler implementation (no accumulation buffer needed) and more regular memory access patterns.
- OLA is conceptually more intuitive, and input segments do not need to overlap (saving a bit of memory).
- Both have exactly the same efficiency (same number and size of FFTs).
When OLA/OLS is not needed:
- Short filter (M < 64) → direct time-domain convolution may be faster (no FFT overhead).
- Non-real-time processing with sufficient memory → perform a single full-length FFT convolution.
- Alternative: Partitioned Convolution splits a long IR into multiple segments of different FFT sizes, balancing low latency and high efficiency.
Interactive: Overlap-Add Segmented Convolution
Convolution of a long signal (4096 points) with a short filter (64 points). Comparing direct convolution and OLA results — perfectly identical.
✅ Quick Check
Q1: An audio reverb impulse response is 4096 points, processed using OLA. What is the minimum FFT length?
Show answer
N >= L + M - 1. If segment length L=4096, then N >= 4096+4096-1=8191, round to 8192 (power of 2).
Q2: Which is simpler to implement, OLA or OLS? Why?
Show answer
OLS is simpler — no extra overlap-add step is needed; just discard the first M-1 points of each segment's IFFT result.
6.4 Filter Design
Learning Objectives
- Design an FIR low-pass filter using the window method (from specifications to h[n])
- Use the Kaiser formula to compute the required filter order and beta parameter
- Compare the pros and cons of FIR vs IIR and choose based on requirements
Why does this matter? Because the first step in almost every DSP system is filtering — noise removal, anti-aliasing, channel selection. Not knowing how to design filters means not knowing how to do DSP.
Previously:6.3 solved the long-sequence convolution problem. Now we design the convolution kernel — the filter itself.
- Window Method: The most intuitive FIR design approach — directly truncate the ideal impulse response and apply a window.
- Kaiser Window: James Kaiser, 1974. Proposed empirical formulas to precisely control the relationship between transition bandwidth and stopband attenuation.
- Parks-McClellan Algorithm (Equiripple Design): James McClellan & Thomas Parks, 1972. An optimization design based on Chebyshev approximation, achieving the minimum maximum error for a given filter order.
Principles
Intuition: The frequency response of an ideal low-pass filter is a rectangular function (passband=1, stopband=0). Taking the inverse Fourier transform yields a sinc function — but sinc is infinitely long! We must truncate it. Truncation = multiplying by a rectangular window = convolving in the frequency domain with a sinc → Gibbs phenomenon (ripples at the edges). Using a better window can suppress the ripples.
Window method steps:
- Ideal impulse response: $h_{\text{ideal}}[n] = \frac{\sin(\omega_c n)}{\pi n}$ (sinc function)
- Truncate to 2M+1 points (M determines the transition bandwidth)
- Multiply by window function $w[n]$: $h[n] = h_{\text{ideal}}[n] \cdot w[n]$
Kaiser window design formulas:
Given stopband attenuation $A_s$ (dB) and transition bandwidth $\Delta\omega$ (rad/s):
M \approx \frac{A_s - 8}{2.285 \cdot \Delta\omega}
\beta = \begin{cases} 0.1102(A_s - 8.7) & \text{if } A_s > 50 \\ 0.5842(A_s - 21)^{0.4} + 0.07886(A_s - 21) & \text{if } 21 \leq A_s \leq 50 \\ 0 & \text{if } A_s < 21 \end{cases}
Derivation: Origin of the Kaiser Window Design Formulas
The Kaiser window is an approximation of the DPSS (Discrete Prolate Spheroidal Sequence), using the zero-order modified Bessel function of the first kind:
w[n] = \frac{I_0\!\left(\beta\sqrt{1-(2n/M)^2}\right)}{I_0(\beta)}, \quad |n| \leq M/2
Through extensive numerical experiments, Kaiser discovered the above empirical formulas relating $\beta$ and $M$ to stopband attenuation $A_s$ and transition bandwidth $\Delta\omega$, with accuracy within 10%.
$\beta$ controls the window shape (larger → wider main lobe, lower sidelobes). $M$ controls the window length (longer → narrower main lobe → narrower transition band).
How to Use
Concrete Example: Design a low-pass FIR with $f_c = 100$Hz, $f_s = 1000$Hz, transition band 50Hz (passband to 100Hz, stopband starting at 150Hz), stopband attenuation 60dB.
// Step 1: Convert specifications to normalized frequency
ωp = 2π × 100 / 1000 = 0.2π (passband edge)
ωs = 2π × 150 / 1000 = 0.3π (stopband edge)
Δω = ωs - ωp = 0.1π ≈ 0.3142 rad
ωc = (ωp + ωs) / 2 = 0.25π (cutoff frequency, midpoint of passband and stopband)
As = 60 dB
// Step 2: Compute β
As > 50, so β = 0.1102 × (60 - 8.7) = 0.1102 × 51.3 = 5.653
// Step 3: Compute filter order M
M = (As - 8) / (2.285 × Δω) = 52 / (2.285 × 0.3142) = 52 / 0.718 ≈ 72.4
Take M = 73 (odd, ensuring Type I FIR)
// Step 4: Generate h[n]
for n in range(-36, 37): // -M/2 to M/2
h_ideal = sin(0.25π × n) / (π × n) // at n=0 take the limit = ωc/π = 0.25
w = kaiser(n, β, M)
h[n] = h_ideal × w
// Step 5: Verify
// Passband ripple < 0.01dB (Kaiser window at As=60dB yields nearly flat passband)
// Stopband attenuation ≈ 60dB ✓
// Transition bandwidth ≈ 50Hz ✓
// Delay = M/2 = 36 samples = 36ms (@1kHz)
FIR vs IIR Comparison
| Property | FIR (Finite Impulse Response) | IIR (Infinite Impulse Response) |
|---|---|---|
| Stability | Always stable (no feedback) | May be unstable (poles must be inside unit circle) |
| Phase | Can achieve perfect linear phase | Nonlinear phase (unless allpass compensation is used) |
| Order | Usually requires higher order | Much lower order for equivalent specs (~1/10) |
| Design | Window method / Parks-McClellan | Butterworth / Chebyshev / Elliptic |
| Use cases | When linear phase is needed (audio, communications) | Real-time control, low-latency requirements |
Python Example: Designing an FIR Low-Pass Filter with scipy.signal.firwin
Application Scenarios
- Audio EQ (Equalizer): A 32-band EQ uses 32 FIR bandpass filters, each ~128 taps, processed in real time with OLA.
- Communications baseband filtering: 5G NR channel filters use FIR, requiring passband flatness <0.1dB and stopband attenuation >50dB.
- Biomedical signal preprocessing: ECG 50/60Hz notch filter requires only a 2nd-order IIR (FIR would need hundreds of taps).
- Kaiser formula is empirical: Results may deviate by 5-10%; always verify the frequency response after design.
- Linear phase = group delay = M/2 samples: High-order FIR delay may be unacceptable (e.g., 1000 taps @1kHz = 500ms delay).
- Parks-McClellan may not converge: With extremely narrow transition bands or very high stopband attenuation, parameter adjustments are needed.
- Very narrow transition band + low latency required → use IIR (Butterworth/Chebyshev/Elliptic).
- Optimal design needed (minimum order to meet specs) → use Parks-McClellan instead of the window method.
- Alternative: Multirate Filtering — decimating first then filtering can dramatically reduce computation. CIC + FIR combination is the standard architecture for SDR.
📝 Worked Example
Design a low-pass FIR: passband edge 200Hz, stopband edge 300Hz, stopband attenuation 50dB, fs=2000Hz. (a) Normalized frequencies? (b) Kaiser beta? (c) Filter order? (d) Center value of h[n]?
Show solution
(a) ωp=200/1000·π=0.2π, ωs=300/1000·π=0.3π, Δω=0.1π
(b) As=50 → β = 0.5842(50−21)0.4 + 0.07886(50−21) = 3.39
(c) M ≈ (50−8)/(2.285×0.1π) = 58.5 → round to 59 (odd) → 59 taps = 60 points
(d) h[M/2] = ωc/π = 0.25π/π = 0.25
✅ Quick Check
Q1: Designing a low-pass FIR with 60dB stopband attenuation and 50Hz transition band — how many taps are needed with a Kaiser window?
Show answer
M ≈ (A_s - 8)/(2.285·Δω) = (60-8)/(2.285·2π·50/fs). If fs=1kHz → Δω=0.1π → M≈(52)/(2.285·0.314)≈72 taps.
Q2: What is the main advantage of FIR filters over IIR?
Show answer
FIR is inherently stable (no risk of poles outside the unit circle) and can easily achieve exact linear phase (= pure delay, no distortion).
✅ Quick Check
Q1: Design a low-pass FIR with 60dB stopband attenuation and transition band 0.1π. How many taps with Kaiser window?
Show answer
M ≈ (A_s - 8)/(2.285·Δω) = (60-8)/(2.285×0.314) ≈ 72 taps.
Q2: What is the biggest advantage of FIR? And of IIR?
Show answer
FIR: inherently stable + exact linear phase. IIR: achieves the same performance with much lower order (= less computation and lower latency).
The Four Types of Linear-Phase FIR Filters
A linear-phase FIR must satisfy a symmetry or antisymmetry condition $h[n] = \pm h[N-1-n]$. Depending on the symmetry type and the parity of the length, they fall into exactly four classes, each with distinct restrictions.
| Type | Symmetry | Length N | Properties | Cannot Realize |
|---|---|---|---|---|
| Type I | Symmetric $h[n]=h[N-1-n]$ | Odd (N odd) | No restriction; most general | — |
| Type II | Symmetric | Even (N even) | $H(e^{j\pi})=0$ | High-pass, band-stop |
| Type III | Antisymmetric $h[n]=-h[N-1-n]$ | Odd | $H(e^{j0})=0$ and $H(e^{j\pi})=0$ | Low-pass, high-pass |
| Type IV | Antisymmetric | Even | $H(e^{j0})=0$ | Low-pass, band-stop |
Derivation: Why do these restrictions exist?
Why does $H(\pi)=0$ for Type II?
For Type II, $N$ is even and $h[n]=h[N-1-n]$. Substituting into the DTFT:
$$H(e^{j\pi}) = \sum_{n=0}^{N-1}h[n](-1)^n$$Pair the symmetric terms: $h[k]\cdot(-1)^k + h[N-1-k]\cdot(-1)^{N-1-k} = h[k]\cdot[(-1)^k + (-1)^{N-1-k}]$
When $N$ is even, $N-1$ is odd, so $(-1)^{N-1-k} = -(-1)^k$, and each pair sums to zero. $\blacksquare$
Why does $H(0)=0$ for Type III/IV?
Antisymmetry $h[n]=-h[N-1-n]$ implies that the coefficient sum $\sum h[n] = 0$ (pairs cancel).
Since $H(e^{j0}) = \sum h[n]$, it must equal zero. $\blacksquare$
Practical choices:
- Low-pass / band-pass → use Type I
- When odd symmetry is required (e.g., Hilbert transformers, differentiators) → use Type III or IV
- Avoid: designing a high-pass with Type II forces a zero at Nyquist, making the specification impossible to meet
- Most design tools (scipy.signal.firwin) automatically select the correct type
Group Delay Analysis
The real meaning of linear phase is not "phase = 0" but "all frequency components are delayed by the same amount of time when passing through the filter" — this is quantified by the group delay.
Definition
Group delay is the negative derivative of the phase response with respect to $\omega$, measured in samples. It tells you how much the envelope of a signal near frequency $\omega$ is delayed.
Why does it matter?
- Linear phase: $\tau_g$ is constant → all frequencies delayed equally → signal shape preserved
- Nonlinear phase: $\tau_g$ varies with frequency → different frequencies delayed differently → waveform distortion (even with ideal magnitude response)
- Audio and measurement applications are particularly sensitive to phase distortion — for example, ECG needs to preserve QRS waveforms
FIR vs IIR Comparison
| Type | Group Delay | Waveform Fidelity |
|---|---|---|
| Linear-phase FIR | Constant $(N-1)/2$ | Perfect (pure delay) |
| Butterworth IIR | Non-constant, peaks at transition band | Distorted |
| Chebyshev I IIR | Severely varying in transition band | Severely distorted |
| Bessel IIR | Nearly constant within passband | Good (optimized for phase) |
Practical applications:
- Audio processing: choose FIR or Bessel; avoid the phase distortion of Chebyshev/Elliptic
- ECG/EEG: must use linear-phase FIR (to preserve QRS/spike waveforms)
- Communications receivers: can use nonlinear-phase IIR (compensated later by an equalizer)
- Real-time forward-only: must be causal → IIR cannot achieve perfect linear phase
- Offline processing: can use filtfilt (forward-backward filtering) to achieve zero phase
6.5 2D FFT & Image Processing
Learning Objectives
- Understand the separability of 2D DFT (row FFT + column FFT)
- Implement image low-pass/high-pass filtering using frequency-domain masks
- Explain the relationship between MRI k-space and 2D FFT
Why does this matter? Because medical imaging (MRI), satellite remote sensing, and computer vision all perform denoising and feature extraction in the frequency domain — 2D FFT is a fundamental skill in image processing.
Previously:6.4 designed one-dimensional filters. Images are two-dimensional signals — 2D FFT lets you analyze and process images in the spatial frequency domain.
Principles
Intuition: In 1D, low frequency = slow variation, high frequency = fast variation. The 2D case is entirely analogous: low spatial frequencies = slowly varying regions (smooth areas, overall brightness); high spatial frequencies = rapidly varying regions (edges, textures, noise).
2D DFT formula:
$$F[u,v] = \sum_{m=0}^{M-1}\sum_{n=0}^{N-1} f[m,n]\,e^{-j2\pi(um/M + vn/N)}$$
Separability: 2D DFT can be decomposed into row FFTs first, then column FFTs:
$$F[u,v] = \sum_{m=0}^{M-1}\left(\sum_{n=0}^{N-1} f[m,n]\,e^{-j2\pi vn/N}\right)e^{-j2\pi um/M}$$
Computational cost: 2D FFT of an M×N image = M row FFTs of length N + N column FFTs of length M = O(MN log(MN)).
Meaning of the frequency domain center:
- Center (u=0, v=0) = DC component = average image brightness
- Near center = low frequency = overall structure and smooth regions
- Far from center = high frequency = edges, details, noise
- Bright line in a specific direction = periodic structure in that direction in the image
Derivation: Separability of 2D DFT
The kernel of 2D DFT, $e^{-j2\pi(um/M + vn/N)}$, can be factored into the product of two 1D kernels:
$$e^{-j2\pi(um/M + vn/N)} = e^{-j2\pi um/M} \cdot e^{-j2\pi vn/N}$$
Therefore:
$$F[u,v] = \sum_m e^{-j2\pi um/M} \underbrace{\left(\sum_n f[m,n]\,e^{-j2\pi vn/N}\right)}_{G[m,v] = \text{1D DFT of row } m}$$
First perform 1D DFT on each row to obtain $G[m,v]$, then perform 1D DFT on each column of $G[m,v]$ to obtain $F[u,v]$. The order can be swapped (columns first, then rows).
How to Use
- Load grayscale image $f[m,n]$ (M×N pixels).
- (Optional) Multiply by $(-1)^{m+n}$ to center the spectrum (fftshift).
- Compute 2D FFT → $F[u,v]$.
- Design frequency-domain mask $H[u,v]$ (low-pass/high-pass/bandpass/notch).
- Multiply $G[u,v] = F[u,v] \cdot H[u,v]$.
- 2D IFFT → processed image $g[m,n]$.
Concrete Example: Removing Power Line Interference from Satellite Imagery
// Satellite image 512×512 pixels, with horizontal stripe interference caused by 60Hz power lines
// Step 1: Compute 2D FFT and observe the spectrum
F = fft2(image)
F_shifted = fftshift(F)
magnitude = log(1 + abs(F_shifted)) // log scale display
// Step 2: Find bright spots in the spectrum corresponding to interference
// Horizontal stripes → bright spots on the vertical axis (at (0, ±v₀))
// v₀ corresponds to the spatial frequency of the interference
// Step 3: Design a Notch Filter
// Place a small circular zero-gain region at each bright spot, radius r=5 pixels
H = ones(512, 512)
for each notch_point (u0, v0):
for u, v in circle(u0, v0, r=5):
H[u, v] = 0 // notch
// Step 4: Frequency-domain filtering + IFFT
G = F_shifted * H
result = real(ifft2(ifftshift(G)))
// Result: stripe interference removed, image details preserved
MRI Connection — k-space Is the 2D Frequency Domain
The MRI scanner's RF coils directly acquire data in k-space (= 2D Fourier space). Each scan trajectory fills one line of k-space. Image reconstruction = 2D IFFT.
- Scanning all of k-space → complete image, but long scan time (several minutes).
- Scanning only the center of k-space (low frequencies) → blurry image but fast (used for localizer scans).
- Scanning only the outer regions of k-space (high frequencies) → only edge information remains.
- Compressed Sensing MRI: Randomly sample part of k-space (~25%), use sparse reconstruction algorithms to recover the full image. Scan time reduced by 4x.
Application Scenarios
- MRI image reconstruction: 256×256 k-space data → 2D IFFT → anatomical image. Compressed sensing can achieve 4-8x acceleration.
- Astronomical image processing: Telescope images are blurred by atmospheric turbulence (PSF blur); Wiener deconvolution in the frequency domain recovers detail.
- Industrial X-ray inspection: Periodic structures in PCB X-ray images (via arrays) produce bright spots in the frequency domain; notch filtering can highlight defects.
- Boundary effects: 2D FFT assumes the image is periodically extended. If the image boundaries are discontinuous, a cross-shaped spectral leakage occurs. Solutions: mirror padding or edge tapering.
- Ringing: Sharp frequency-domain masks (e.g., ideal low-pass) cause ringing in the spatial domain. Use Gaussian or Butterworth-type masks to mitigate.
- Phase matters: The structural information of an image is primarily in the phase (not the magnitude). Do not destroy phase during processing.
- Structures in the image are non-periodic → frequency-domain methods perform poorly; use spatial-domain methods (e.g., bilateral filter, non-local means).
- Local processing is needed (different filtering for different regions) → 2D STFT or wavelet is needed.
- Alternative: Wavelet transforms are more powerful than 2D FFT for multi-scale analysis, and are mainstream for modern image compression (JPEG 2000) and denoising.
Interactive: 2D FFT Image Filtering
Select a test pattern, observe its 2D spectrum, and apply different frequency-domain filters.
Original Image
2D FFT Spectrum
Filtered Image
6.6 OFDM (Orthogonal Frequency Division Multiplexing)
Learning Objectives
- Describe the complete OFDM symbol transmit/receive flow (IFFT → +CP → channel → -CP → FFT)
- Derive how CP converts linear convolution into circular convolution
- Perform frequency-domain channel estimation using LS/MMSE
Why does this matter? Because your phone, Wi-Fi, and digital TV all use OFDM — it is the absolute core of modern wireless communications.
Previously:6.5 demonstrated FFT applications in imaging. Now we look at communications — OFDM uses IFFT/FFT to drive every connection on your phone.
- Robert Chang, 1966: Proposed the basic concept of OFDM at Bell Labs.
- Weinstein & Ebert, 1971: First implemented OFDM modulation/demodulation using DFT/IDFT, making OFDM practically feasible.
- Peled & Ruiz, 1980: Introduced Cyclic Prefix (CP) to solve the ISI problem.
- Widespread adoption from the 1990s: DVB-T (digital TV), 802.11a (Wi-Fi), ADSL.
- 4G LTE (2009), 5G NR (2018), Wi-Fi 6/7 are all OFDM-based.
Principles
Intuition: Instead of using a single high-speed carrier to transmit large amounts of data (each symbol is very short → easily corrupted by ISI), spread the data across N low-speed subcarriers for parallel transmission. Each subcarrier's symbol period = N times the original → much longer than the channel delay spread → ISI becomes negligible.
Transmitter:
- N QAM (Quadrature Amplitude Modulation) symbols $D[0], D[1], \ldots, D[N-1]$
- N-point IFFT: $d[n] = \text{IFFT}\{D[k]\}$ → time-domain OFDM symbol
- Add CP (Cyclic Prefix): copy the last $N_{CP}$ points of $d[n]$ to the front
- Pass through DAC and RF front-end for transmission
Receiver:
- Remove CP
- N-point FFT: $Y[k] = \text{FFT}\{y[n]\}$
- Equalize each subcarrier independently: $\hat{D}[k] = Y[k] / \hat{H}[k]$ (only one complex division!)
Why does CP work?
CP converts linear convolution into circular convolution — which is exactly what FFT multiplication assumes. As long as the CP length ≥ channel delay spread, after FFT at the receiver each subcarrier sees $Y[k] = H[k] \cdot D[k] + W[k]$, a perfect frequency-domain multiplication relationship.
Full Derivation: How CP Eliminates ISI
Channel impulse response $h[n]$, length L (delay spread = L-1 samples).
Transmitted time-domain signal $d[n]$ of length N, with CP the total length is $N + N_{CP}$.
Received signal: $r[n] = \sum_{l=0}^{L-1} h[l]\,d_{\text{CP}}[n-l] + w[n]$
After removing the CP (taking n = N_CP to N_CP+N-1), if $N_{CP} \geq L-1$:
- The "tail" of the previous OFDM symbol falls entirely within the CP interval → discarded → no ISI.
- The convolution of the current symbol "ramps up" within the CP interval; after CP removal, it is equivalent to circular convolution of $d[n]$.
Circular convolution in the frequency domain = element-wise multiplication:
Y[k] = H[k] \cdot D[k] + W[k]
Therefore each subcarrier can be equalized independently: $\hat{D}[k] = Y[k] / H[k]$. In contrast, single-carrier systems require matrix inversion (O(N³) or equalizer approximation).
How to Use
5G NR Specific Parameters:
| Numerology (μ) | Subcarrier Spacing (SCS) | FFT Size (max) | CP Length | Symbol Duration | Use Case |
|---|---|---|---|---|---|
| 0 | 15 kHz | 4096 | 288 / 352 samples | 71.4 μs | <3GHz wide-area coverage |
| 1 | 30 kHz | 4096 | 288 / 352 samples | 35.7 μs | 3.5GHz mainstream deployment |
| 2 | 60 kHz | 4096 | 288 / 352 samples | 17.8 μs | mmWave |
| 3 | 120 kHz | 4096 | 288 / 352 samples | 8.9 μs | mmWave high-speed |
Concrete Example: 5G NR μ=1 (30kHz SCS)
// FFT size: 4096 // Sample rate: 4096 × 30kHz = 122.88 MHz // Active subcarriers: 3276 (100MHz bandwidth) // OFDM symbol duration: 1/30kHz + CP = 33.33μs + 2.34μs = 35.67μs // Per slot (14 symbols): 0.5 ms // OFDM symbols per second: 14 × 2000 = 28,000 // FFTs per second (TX+RX): 28,000 × 2 = 56,000 4096-point FFTs // Complex multiplications per second: 56,000 × 4096 × log₂(4096)/2 = 56,000 × 24,576 ≈ 1.38 × 10⁹
Channel Estimation
- Pilots (Reference Signals): Insert known symbols at known subcarrier positions.
- LS Estimation (Least Squares): $\hat{H}[k_p] = Y[k_p] / D_{\text{pilot}}[k_p]$ (only at pilot positions)
- Interpolation: Interpolate from LS-estimated pilot points to obtain $\hat{H}[k]$ for all subcarriers.
- MMSE Estimation: Exploits the statistical properties of the channel delay profile for optimal interpolation, outperforming LS by 3-5dB.
- Frequency-domain equalization: Zero-Forcing: $\hat{D}[k] = Y[k]/\hat{H}[k]$. MMSE: $\hat{D}[k] = \frac{\hat{H}^*[k]}{|\hat{H}[k]|^2 + \sigma_w^2/\sigma_D^2} Y[k]$
Application Scenarios
- 5G NR: Both downlink and uplink use CP-OFDM (uplink can also use DFT-s-OFDM to reduce PAPR). Over 2 million base stations deployed globally (2025).
- Wi-Fi 6/7 (802.11ax/be): OFDMA (multi-user OFDM), 2048-point FFT, supporting 160/320MHz bandwidth.
- DVB-T2 digital television: 32K FFT (32768 points), handling long delay spreads (mountain reflections).
- Carrier Frequency Offset (CFO): Transmitter and receiver oscillator frequencies are not perfectly matched → subcarriers are no longer orthogonal → Inter-Carrier Interference (ICI). For example, 5G @28GHz with 0.1ppm oscillator accuracy = 2.8kHz offset, which is 9.3% of 30kHz SCS — frequency synchronization is mandatory.
- PAPR (Peak-to-Average Power Ratio): When the phases of N subcarriers align, peak power can reach N times the average power. PAPR ~10-12dB → power amplifiers need a large linear range (expensive and power-hungry). PAPR reduction methods: clipping, DFT-s-OFDM, tone reservation.
- CP is overhead: CP carries no new information and occupies ~7% of time and spectral resources.
- The channel has almost no multipath (e.g., satellite communications) → single carrier is sufficient, and there is no PAPR problem.
- Ultra-low latency requirements (CP adds latency) → FBMC (Filter Bank Multi-Carrier) does not need CP.
- Alternative: SC-FDMA (Single-Carrier FDMA) is used for LTE uplink with 2-3dB lower PAPR than OFDM. FBMC and UFMC are candidate technologies for beyond-5G.
✅ Quick Check
Q1: What happens if the OFDM cyclic prefix (CP) is too short?
Show answer
Channel delay exceeds CP → adjacent OFDM symbols overlap → ISI. Also, the effective convolution within the FFT interval is not circular → subcarrier orthogonality is broken → ICI.
Q2: 5G NR subcarrier spacing 30kHz, 4096 FFT — what is the approximate bandwidth?
Show answer
30kHz × 4096 ≈ 122.88 MHz.
Interactive: OFDM Symbol Generation (IFFT)
64 subcarriers, each carrying a QPSK symbol. IFFT converts frequency-domain data into a time-domain OFDM symbol.
Interactive: Complete OFDM Transceiver Simulation
Full simulation of the OFDM transceiver chain: IFFT generates symbol → add CP → multipath channel → add noise → remove CP → FFT → LS channel estimation → equalization → demodulation.
6.7 Radar Signal Processing
Learning Objectives
- Implement pulse compression using matched filtering
- Explain the dual-FFT architecture of Range-Doppler processing
- Determine a waveform's range-velocity resolution capability from its ambiguity function
Why does this matter? Because autonomous driving, weather forecasting, and military detection all rely on radar, and the core of radar signal processing is FFT.
Previously:6.6 covered FFT in communications. Radar is also a heavy user of FFT — range uses fast-time FFT, velocity uses slow-time FFT.
- Pulse Compression: P.M. Woodward, 1953. Used matched filtering to improve range resolution without increasing peak power.
- Pulse-Doppler Processing: 1960s, led by the US Air Force, using FFT to extract Doppler frequencies from multiple pulse echoes.
- Synthetic Aperture Radar (SAR): Developed from the 1950s, synthesizing a large antenna aperture from the flight path to achieve high-resolution imagery.
Principles
Intuition: Transmit a known waveform $s(t)$; the target-reflected signal is a delayed + frequency-shifted version. To find the delay (= range) and frequency shift (= velocity), the best approach is matched filtering — correlating the known waveform with the echo. FFT enables this correlation to be computed efficiently.
Matched Filter
The optimal filter that maximizes SNR = the time-reversed conjugate of the transmitted signal:
$$h_{\text{MF}}(t) = s^*(-t)$$
Frequency-domain implementation: $Y(\omega) = S^*(\omega) \cdot X(\omega)$ → IFFT → compressed pulse
Range-Doppler Processing
Received data is arranged in a 2D matrix: [fast-time × slow-time]
- Fast-time FFT (each row): Range compression — compresses chirp pulses into narrow peaks.
- Slow-time FFT (each column): Doppler processing — extracts velocity from phase changes across multiple pulses.
- Result: Range-Doppler Map — each bright spot = a target, horizontal position = range, vertical position = velocity.
Resolution:
\Delta R = \frac{c}{2B} \quad (\text{range resolution, B = bandwidth})
$$\Delta v = \frac{\lambda}{2T_{\text{CPI}}} \quad (\text{velocity resolution, }T_{\text{CPI}} = \text{coherent processing interval})$$
Ambiguity Function
Definition:
$$\chi(\tau, f_d) = \int_{-\infty}^{\infty} s(t)\,s^*(t-\tau)\,e^{j2\pi f_d t}\,dt$$
Intuitive interpretation: The ambiguity function describes a waveform's resolution capability on the "range ($\tau$) — Doppler ($f_d$)" plane. $|\chi(0,0)|$ = peak (perfect match); $|\chi(\tau, f_d)|$ at other locations = sidelobes (degree of ambiguity).
Woodward's Theorem: The total volume of the ambiguity function is conserved — suppressing sidelobes in one area raises them elsewhere. Waveform design is about "sculpting" sidelobe shapes on the range-Doppler plane.
- LFM Chirp: Oblique ridge-shaped ambiguity function (range and Doppler are coupled, but the main lobe is narrow).
- Phase-Coded Pulse: Thumbtack shape (low sidelobes but Doppler-sensitive).
Derivation: Range Resolution of LFM Chirp
LFM (Linear Frequency Modulated) chirp: $s(t) = \text{rect}(t/T)\,e^{j\pi \mu t^2}$, where $\mu = B/T$ (chirp rate).
Matched filter output (at zero Doppler):
\chi(\tau, 0) = \int s(t)\,s^*(t-\tau)\,dt
After derivation (expanding and simplifying the quadratic phase terms), the result is approximately:
|\chi(\tau, 0)| \approx T\,\text{sinc}(B\tau)
The first null of the sinc function is at $\tau = 1/B$, corresponding to range:
\Delta R = \frac{c\tau}{2} = \frac{c}{2B}
Key observation: range resolution depends only on bandwidth B, independent of pulse duration T. This is the power of pulse compression — long pulse (high energy) + wide bandwidth (high resolution) can be achieved simultaneously.
How to Use
Concrete Example: Automotive 77GHz FMCW Radar
// System parameters f_c = 77 GHz // carrier frequency B = 1 GHz // Chirp bandwidth T_chirp = 50 μs // single Chirp duration N_chirps = 128 // number of Chirps per CPI λ = c / f_c = 3.896 mm // wavelength // Range resolution ΔR = c / (2B) = 3×10⁸ / (2×10⁹) = 0.15 m = 15 cm // Maximum unambiguous range R_max = c × T_chirp / 2 = 3×10⁸ × 50×10⁻⁶ / 2 = 7500 m // (in practice limited by ADC sample rate and SNR, typically ~200m) // Velocity resolution Δv = λ / (2 × N_chirps × T_chirp) = 3.896×10⁻³ / (2 × 128 × 50×10⁻⁶) = 0.304 m/s ≈ 1.1 km/h // Maximum unambiguous velocity v_max = λ / (4 × T_chirp) = 3.896×10⁻³ / (4 × 50×10⁻⁶) = 19.5 m/s ≈ 70 km/h // FFT processing // Fast-time: 256-point FFT → 256 range bins, each bin = R_max/256 ≈ 29 m // (actual ADC samples, e.g., 256 @10MHz → covers 3840m) // Slow-time: 128-point FFT → 128 velocity bins // Range-Doppler Map: 256 × 128 matrix, each bright spot = one target
Application Scenarios
- Automotive FMCW radar: 77GHz, B=1-4GHz, range resolution 3.75-15cm. 3-5 radars per vehicle, global annual production exceeding 500 million units (2025).
- Weather radar: S-band (2.7-3.0GHz), using Doppler FFT to measure radial velocity of precipitation particles (wind field). WSR-88D Doppler radar uses 1024-point FFT, velocity resolution ~0.5 m/s.
- SAR satellite imagery: Azimuth focusing uses FFT, achieving ~1m spatial resolution. Sentinel-1 satellite processes TB-scale data per orbit.
- LFM range-Doppler coupling: A moving target's Doppler shift is misinterpreted as a range offset. Doppler compensation or dual-slope chirp is needed.
- Sidelobe masking: Sidelobes of strong targets can obscure weak targets. Windowing to reduce sidelobes + CFAR (Constant False Alarm Rate) adaptive threshold detection is needed.
- Blind speed: When a target's velocity is exactly an integer multiple of $v_{\text{max}}$, the Doppler shift wraps around to zero → undetectable. Solved by staggered PRF (Pulse Repetition Frequency).
- Very few and known targets → parametric estimation methods (e.g., MUSIC, ESPRIT) offer higher resolution than FFT.
- Nonlinear frequency modulation (NLFM) waveforms → matched filtering still applies, but standard FFT-based processing needs modification.
- Alternative: Compressed Sensing radar can reconstruct the Range-Doppler map from fewer samples (suitable for sparse scenes). MIMO radar uses multiple transmit/receive antennas to increase virtual aperture.
📝 Worked Example
77GHz FMCW automotive radar: bandwidth B=1GHz, chirp duration 50μs, 128 chirps. (a) Range resolution? (b) Maximum unambiguous range (256-point FFT)? (c) Velocity resolution?
Show solution
(a) ΔR = c/(2B) = 3×10⁸/(2×10⁹) = 0.15m = 15cm
(b) Rmax = N·ΔR = 256×0.15 = 38.4m
(c) λ = c/f = 3.9mm, Δv = λ/(2·128·50μs) = 3.9×10⁻³/(2×128×50×10⁻⁶) = 0.30 m/s
✅ Quick Check
Q1: For a radar range resolution of 15cm, what bandwidth is needed?
Show answer
ΔR = c/(2B) → B = c/(2·0.15) = 3×10⁸/(0.3) = 1 GHz.
Q2: Why is the LFM chirp's ambiguity function ridge-shaped?
Show answer
Because LFM frequency increases linearly with time, creating a coupling between range and Doppler — the range estimate of a stationary target shifts due to Doppler offset.
Interactive: Ambiguity Function
The ambiguity function describes a radar waveform's resolution capability in two dimensions: range (delay τ) and velocity (Doppler fd). Its shape determines the waveform's performance.
Interactive: Matched Filter & Pulse Compression
Radar transmits a long chirp pulse, which the matched filter at the receiver compresses into a sharp peak. Long pulse = high energy (long-range detection); after compression = high range resolution (resolving close targets).
Interactive: Radar Target Placement & Range-Doppler
Set the range, velocity, and amplitude of 3 targets and observe the response on the Range-Doppler Map.
6.8 Array Signal Processing
Learning Objectives
- Write the ULA steering vector and explain the spatial Nyquist condition
- Compare three beamforming methods: Delay-and-Sum, Capon, and MUSIC
- Compute the angular resolution of an array from the beamwidth formula
Why does this matter? Because 5G massive MIMO, phased array radar, and sonar positioning all rely on antenna arrays — spatial filtering is FFT in the spatial dimension.
Previously:6.7 used FFT for range-velocity estimation in radar. Antenna arrays extend the same concept to space — using "spatial FFT" to estimate signal direction of arrival.
- Phased Array: Already in use during WWII radar in the 1940s.
- Capon (MVDR) Beamforming: Jack Capon, 1969. Minimum Variance Distortionless Response.
- MUSIC Algorithm: Ralph Schmidt, 1986. Exploits the orthogonality between signal subspace and noise subspace for high-resolution DOA (Direction of Arrival) estimation.
- Massive MIMO: Thomas Marzetta, 2010. Proposed using large numbers of antennas (64-256) to simultaneously serve multiple users. Became a core 5G technology.
Principles
Intuition: A set of equally spaced antennas (Uniform Linear Array, ULA) receives the same plane wave, but each antenna has a slightly different reception time (depending on the wave's angle of incidence). This time difference = phase difference. Analyzing these phase differences reveals the signal's direction of arrival. This is perfectly analogous to time-domain sampling: antenna spacing = spatial sampling interval, angle of incidence = spatial frequency.
ULA Model
M antennas equally spaced by d. A plane wave arrives from angle $\theta$, and the phase difference between adjacent antennas is:
\Delta\phi = \frac{2\pi d \sin\theta}{\lambda}
Steering Vector:
$$\mathbf{a}(\theta) = \begin{bmatrix} 1 \\ e^{j\frac{2\pi d\sin\theta}{\lambda}} \\ e^{j\frac{2\pi \cdot 2d\sin\theta}{\lambda}} \\ \vdots \\ e^{j\frac{2\pi(M-1)d\sin\theta}{\lambda}} \end{bmatrix}$$
Beamforming = Spatial Filtering
y = \mathbf{w}^H \mathbf{x}
where $\mathbf{x}$ is the received vector from M antennas and $\mathbf{w}$ is the weight vector.
Conventional Beamforming (Delay-and-Sum, DAS): $\mathbf{w} = \mathbf{a}(\theta_0)/M$. This is essentially a spatial DFT — scanning all $\theta$ is equivalent to performing DFT on spatial samples.
Spatial Nyquist Theorem
d ≤ λ/2, otherwise grating lobes appear (spatial aliasing! Perfectly analogous to aliasing in the time-domain Nyquist theorem).
When d > λ/2, signals from different directions produce identical phase differences on the antenna array → indistinguishable → spatial aliasing.
Derivation: Spatial DFT and Angular Resolution
Spatial power spectrum of the DAS beamformer:
$$P_{\text{DAS}}(\theta) = \frac{1}{M^2}\left|\sum_{m=0}^{M-1} e^{j\frac{2\pi md}{\lambda}(\sin\theta - \sin\theta_0)}\right|^2$$
Let the spatial frequency be $u = d\sin\theta/\lambda$ — this is a discrete Fourier sum!
Main beam width (3dB beamwidth):
$$\Delta\theta_{3\text{dB}} \approx \frac{0.886\lambda}{Md\cos\theta_0}$$
At broadside ($\theta_0 = 0$), this simplifies to $\Delta\theta \approx 0.886\lambda/(Md)$.
More antennas (larger M) and wider spacing (larger d) → narrower beam → higher angular resolution. But d > λ/2 produces grating lobes.
How to Use
Concrete Example: 5G Massive MIMO Base Station
// Parameters M = 64 antennas (8×8 planar array) f_c = 28 GHz (mmWave) λ = c / f_c = 3×10⁸ / 28×10⁹ = 10.71 mm d = λ/2 = 5.36 mm (antenna spacing) // Beamwidth Δθ ≈ 0.886 × λ / (M_row × d) = 0.886 × 10.71 / (8 × 5.36) = 0.221 rad ≈ 12.7° // (8×8 array has 8 antennas in horizontal and vertical) // Using all 64 antennas for 2D beamforming: // Effective aperture = 8 × 5.36mm = 42.9mm // Beamwidth ≈ 12.7° × 12.7° (both dimensions) // Spatial multiplexing capability // 64 antennas can form multiple independent beams simultaneously // Max ~M/2 = 32 users (theoretical upper limit) // In practice 8-16 parallel users (limited by channel correlation) // Spectral efficiency improvement // Single user: ~5 bps/Hz // 16-user MU-MIMO: ~80 bps/Hz (ideal case)
Advanced Methods
Capon (MVDR) Beamformer:
$$\mathbf{w}_{\text{Capon}} = \frac{\mathbf{R}^{-1}\mathbf{a}(\theta_0)}{\mathbf{a}^H(\theta_0)\mathbf{R}^{-1}\mathbf{a}(\theta_0)}$$
Minimizes output power subject to unity gain in the desired direction. Result: a much narrower beam than DAS, effectively suppressing interference.
MUSIC DOA Estimation:
Completely analogous to the frequency-domain MUSIC (see Section 3.4), just replacing "frequency" with "angle":
$$P_{\text{MUSIC}}(\theta) = \frac{1}{\mathbf{a}^H(\theta)\,\mathbf{E}_n\mathbf{E}_n^H\,\mathbf{a}(\theta)}$$
where $\mathbf{E}_n$ is the noise subspace. Peak positions = signal directions of arrival. Can resolve sources separated by much less than the beamwidth.
Application Scenarios
- 5G Massive MIMO: 64-256 antennas, mmWave beam tracking, updating beam direction every millisecond.
- Radar electronic scanning (AESA): Fighter jet radars use thousands of antenna elements for electronic scanning, switching beam direction in milliseconds (mechanical rotation takes seconds).
- Hearing aid beamforming: 2-4 microphone arrays for speech enhancement. In noisy environments, the beam is steered toward the speaker (SNR improvement of 5-10dB).
- Calibration errors: Inconsistencies in antenna position, gain, and phase severely degrade performance. Massive MIMO requires periodic Over-the-Air Calibration.
- Capon/MUSIC requires sufficient snapshots: Estimating the covariance matrix $\mathbf{R}$ requires ≥2M snapshots for stability. Rapidly changing environments may not allow enough accumulation.
- Wideband signals: The steering vector $\mathbf{a}(\theta)$ depends on frequency. Wideband signals require space-time processing (wideband beamforming), such as DFT beamforming + sub-band processing.
- Near-field effects: When target distance < $2D^2/\lambda$ (D = array aperture), the plane wave assumption breaks down, and near-field beamforming is needed.
- Only one signal source and direction is not needed → a single antenna suffices.
- Signal sources are confined to a narrow angular range (e.g., satellite communications) → a fixed-pointing antenna is cheaper than an array.
- Alternative: If you need to separate co-frequency signals but do not care about direction, use CDMA (Code Division Multiple Access) or NOMA (Non-Orthogonal Multiple Access).
✅ Quick Check
Q1: What happens when ULA antenna spacing d > λ/2?
Show answer
Grating lobes — which are spatial aliasing. Analogous to aliasing caused by insufficient sampling rate in the time domain.
Q2: 5G massive MIMO with 64 antennas @28GHz (λ≈10.7mm) — approximately what is the beamwidth?
Show answer
≈ λ/(M·d) ≈ 10.7/(64×5.35) ≈ 0.03 rad ≈ 1.8°(assuming d=λ/2=5.35mm).
6.9 Biomedical Signal Analysis
Learning Objectives
- List the five major EEG frequency bands and explain their physiological significance
- Estimate EEG band power ratios using the Welch method
- Describe the standard workflow for HRV frequency-domain analysis (R-R intervals → resampling → PSD → LF/HF)
Why does this matter? Because clinical interpretation of EEG/ECG increasingly relies on frequency-domain quantitative metrics — this is an essential skill for entering biomedical engineering.
Previously:6.8 dealt with man-made signals. Biomedical signals (brainwaves, ECG) are also important application scenarios for frequency-domain analysis.
- Hans Berger, 1929: First recorded human EEG (Electroencephalography), discovering the 8-13Hz α wave (Alpha rhythm) that appears when eyes are closed.
- HRV frequency-domain analysis standardization: The European Society of Cardiology (ESC) and the North American Society of Pacing and Electrophysiology (NASPE) published the Task Force report in 1996, defining the LF/HF frequency bands and analysis methods for HRV.
Principles
Five Major EEG Frequency Bands
| Band | Frequency Range | Physiological Significance | Clinical Application |
|---|---|---|---|
| δ (Delta) | 0.5 – 4 Hz | Deep sleep (N3 stage) | Sleep depth assessment, brain injury detection |
| θ (Theta) | 4 – 8 Hz | Light sleep, meditation, memory encoding | Sleep staging (N1/N2), attention index |
| α (Alpha) | 8 – 13 Hz | Relaxed wakefulness, eyes closed | Alertness assessment, BCI control |
| β (Beta) | 13 – 30 Hz | Focus, alertness, active thinking | Attention monitoring, anxiety assessment |
| γ (Gamma) | 30 – 100 Hz | Higher cognition, cross-regional integration | Epileptic high-frequency oscillation detection |
Analysis methods:
- Welch PSD (see Section 3.2) → integrate power in each band → power ratios
- For example, α/θ ratio = attention index (higher = more focused)
- Relative power: $P_\alpha^{\text{rel}} = P_\alpha / P_{\text{total}}$
HRV Frequency-Domain Analysis
Heart Rate Variability (HRV) reflects the activity of the Autonomic Nervous System.
| Band | Frequency Range | Primary Regulation |
|---|---|---|
| VLF (Very Low Frequency) | 0.003 – 0.04 Hz | Thermoregulation, renin-angiotensin system |
| LF (Low Frequency) | 0.04 – 0.15 Hz | Sympathetic + parasympathetic (not purely sympathetic!) |
| HF (High Frequency) | 0.15 – 0.4 Hz | Parasympathetic (respiratory sinus arrhythmia) |
Analysis workflow: R-R interval series (Tachogram) → resample (non-uniform → uniform, typically cubic spline @4Hz) → PSD → band power integration.
Derivation: Why Does HRV Require Resampling?
The R-R interval series is inherently non-uniformly sampled (because heartbeat intervals are not fixed). Fourier analysis requires uniform sampling.
Solution:
- Place each R-R interval value at its corresponding time point (non-uniform sequence).
- Use cubic spline interpolation to generate a uniform sequence (typically 4Hz = one point every 0.25 seconds).
- The Nyquist frequency at 4Hz sample rate = 2Hz, sufficient to cover the HF band (0.4Hz).
Alternative: The Lomb-Scargle periodogram can directly handle non-uniform data, but Welch PSD is more widely used in clinical practice.
How to Use
EEG Analysis Steps
// Step 1: Signal acquisition // Sample rate: 250-1000 Hz (clinical standard: 256 or 512 Hz) // Electrodes: International 10-20 system (19-64 channels) // ADC resolution: 24-bit (EEG amplitude is only ~10-100 μV) // Step 2: Preprocessing band_pass_filter(0.5, 100) // Remove DC drift and high-frequency noise notch_filter(60) // Remove power line interference (60Hz) ICA_artifact_removal() // Independent component analysis to remove eye/muscle artifacts // Step 3: Spectral analysis (Welch method) segment_length = 2 * fs // 2-second segments (512 points @256Hz) overlap = 0.5 // 50% overlap window = 'hann' PSD = welch(eeg_channel, segment_length, overlap, window) // Step 4: Band power computation P_delta = integrate(PSD, 0.5, 4) // δ power P_theta = integrate(PSD, 4, 8) // θ power P_alpha = integrate(PSD, 8, 13) // α power P_beta = integrate(PSD, 13, 30) // β power P_gamma = integrate(PSD, 30, 100) // γ power P_total = P_delta + P_theta + P_alpha + P_beta + P_gamma // Step 5: Compute metrics attention_index = P_alpha / P_theta // α/θ ratio relative_alpha = P_alpha / P_total // relative α power // Sleep staging example: // Awake (eyes closed): α dominant (relative_alpha > 0.4) // N1 light sleep: θ increases, α decreases // N2: Sleep spindles (12-14Hz bursts) + K-complex // N3 deep sleep: δ dominant (relative_delta > 0.5, amplitude > 75μV) // REM: Low-amplitude mixed frequency (similar to awake, but with rapid eye movements)
HRV Analysis Steps
// Step 1: R-wave detection (Pan-Tompkins Algorithm)
ecg_filtered = bandpass(ecg, 5, 15) // Focus on QRS complex
ecg_diff = differentiate(ecg_filtered)
ecg_squared = ecg_diff ** 2
ecg_integrated = moving_average(ecg_squared, 150ms)
R_peaks = adaptive_threshold(ecg_integrated)
// Step 2: Compute R-R intervals (Tachogram)
RR_intervals = diff(R_peaks) / fs // units: seconds
RR_times = cumsum(RR_intervals) // corresponding time points
// Step 3: Remove ectopic beats
for i in range(len(RR)):
if abs(RR[i] - RR[i-1]) / RR[i-1] > 0.20:
RR[i] = interpolate(neighbors) // interpolate to replace
// Step 4: Resample to uniform spacing
fs_resample = 4 // 4 Hz
RR_resampled = cubic_spline_interpolate(RR_times, RR_intervals, fs_resample)
RR_resampled -= mean(RR_resampled) // remove DC
// Step 5: Welch PSD
segment_length = 256 // 256 points @4Hz = 64 seconds
overlap = 0.5
PSD_hrv = welch(RR_resampled, segment_length, overlap, 'hann')
// Step 6: Band power
LF_power = integrate(PSD_hrv, 0.04, 0.15) // ms²
HF_power = integrate(PSD_hrv, 0.15, 0.40) // ms²
LF_HF_ratio = LF_power / HF_power
// Normal reference values:
// Healthy young adults: LF/HF ≈ 1.0-2.0
// Stress/anxiety: LF/HF ↑ (sympathetic activation)
// Athletes at rest: HF ↑, LF/HF ↓ (parasympathetic dominant)
// Heart failure: Total power ↓↓, LF/HF may be abnormally high or low
Application Scenarios
- Brain-Computer Interface (BCI): Detects μ (8-12Hz) / β (18-26Hz) rhythm changes caused by motor imagery, controlling wheelchairs or robotic arms. Classification accuracy 70-90%.
- Sleep monitoring wearables: Using single-channel frontal EEG (e.g., Muse headband), computing δ/θ/α power for automatic sleep staging. Sample rate 256Hz, Welch PSD every 30-second epoch.
- Cardiac autonomic assessment: 5-minute short-term HRV analysis for stress assessment, athletic training monitoring, and diabetic autonomic neuropathy screening. Apple Watch/Garmin watches have this feature built in.
- EEG artifacts dominate the spectrum: Eye movements (EOG) produce large-amplitude 0-4Hz interference → misidentified as δ activity. Muscle activity (EMG) contaminates >20Hz. 50/60Hz power line interference. Artifacts must be removed first.
- LF ≠ sympathetic activity: This is the most common misconception! The LF band is modulated by both sympathetic and parasympathetic nervous systems (baroreflex mechanism). Only HF more purely reflects parasympathetic activity. The LF/HF ratio is only a rough indicator of "sympathovagal balance."
- Short-segment HRV is unreliable: VLF requires at least 5 minutes of data (otherwise insufficient frequency resolution). LF requires at least 2 minutes. 24-hour long-term analysis is more stable.
- Breathing rate affects HF: If the subject breathes very slowly (<0.15Hz, e.g., during meditation), the respiratory component falls in the LF band instead of HF → LF/HF ratio is distorted. Breathing rate must be simultaneously recorded.
- Transient events in EEG (e.g., epileptic spikes, lasting <200ms) → frequency-domain methods cannot localize in time. Use time-frequency analysis (STFT, wavelet).
- HRV during cardiac arrhythmia (e.g., atrial fibrillation) → RR intervals are completely irregular, frequency-domain analysis is meaningless. Use nonlinear methods (approximate entropy, Poincare plot).
- Alternative: Multiscale Entropy analysis, Recurrence Plot, deep learning automatic feature extraction.
✅ Quick Check
Q1: Which EEG frequency band is enhanced when the eyes are closed?
Show answer
Alpha waves (8-13 Hz) — this is the Berger effect.
Q2: What does the LF/HF ratio in HRV frequency-domain analysis represent?
Show answer
Often interpreted as a sympathetic/parasympathetic balance indicator, but this is actually an oversimplification — LF is modulated by both sympathetic and parasympathetic activity.
Interactive: EEG Sleep Staging Exercise
The system generates a simulated EEG power spectrum. Determine the sleep stage based on the power distribution across frequency bands.
6.10 Complete Vibration Analysis Workflow
Learning Objectives
- Describe the complete 6-step workflow from sensor to fault diagnosis in vibration analysis
- Compute BPFO/BPFI characteristic frequencies for given bearing parameters
- Distinguish spectral signatures of imbalance (1X), misalignment (2X), looseness (multi-harmonic), and bearing faults (BPFO)
Why does this matter? Because Predictive Maintenance saves industry billions of dollars annually, and its core is vibration spectrum analysis.
Previously:6.9 analyzed human body signals. This final chapter ties all tools together, demonstrating the complete end-to-end vibration analysis workflow from sensor to fault diagnosis.
- ISO 10816 / ISO 20816: Defines vibration severity levels (A: Good → D: Dangerous), the international standard for industrial vibration monitoring.
- Robert Randall and Jérôme Antoni: Systematized the Envelope Spectrum Analysis method, particularly the band selection strategy (Spectral Kurtosis) for bearing fault diagnosis.
Complete Vibration Analysis Workflow
Step 1: Signal Acquisition
- Sensor: Accelerometer, typically IEPE/ICP type (built-in amplifier, single coaxial cable output).
- Sample rate: At least 2.56× the highest frequency of interest (ISO recommends including anti-aliasing filter roll-off).
- General rotating machinery: fs = 25.6 kHz (covering up to 10 kHz)
- Bearing diagnostics: fs = 51.2 kHz (high-frequency resonance band needed)
- Gearbox analysis: fs > 50 kHz (mesh frequency can be very high)
- Anti-Aliasing Filter: Analog low-pass filter before the ADC, cutoff frequency set at 0.4×fs.
- Mounting: Stud mount > magnet > adhesive > handheld probe. Mounting quality directly affects high-frequency response.
Step 2: Preprocessing
// DC offset removal signal -= mean(signal) // High-pass filter: remove < 5Hz low-frequency interference // (inertial base vibration, loose sensor mounting, temperature drift) signal = highpass(signal, fc=5, order=4) // Bandpass filter when needed // (focus on frequency band of interest, e.g., around gear mesh frequency) signal_bp = bandpass(signal, f_low, f_high, order=6)
Step 3: Basic Spectrum Analysis
// Windowed FFT + Welch averaging nfft = 8192 // Frequency resolution = fs/nfft = 25600/8192 = 3.125 Hz window = 'hann' n_averages = 16 // 16-segment averaging, reduces random variance overlap = 0.5 // 50% overlap PSD = welch(signal, nfft, overlap, window, n_averages)
Identifying Characteristic Frequencies:
| Fault Type | Characteristic Frequency | Spectral Signature |
|---|---|---|
| Imbalance | 1X = shaft speed | High 1X amplitude, primarily radial |
| Misalignment | 2X = 2×shaft speed | High 2X amplitude, may include axial vibration |
| Mechanical Looseness | Multiple harmonics (1X, 2X, 3X...) | Multiple shaft speed harmonics, possibly 0.5X sub-harmonic |
| Gear Fault | Mesh frequency = tooth count×speed | Sidebands ±1X around mesh frequency |
| Blade/Vane Issues | Blade pass frequency = blade count×speed | Fluid pulsation in pumps/fans |
| Bearing Fault | BPFO/BPFI/BSF/FTF | Requires envelope spectrum analysis (see Step 4) |
Step 4: Envelope Spectrum Analysis (Bearing Fault Diagnosis)
Intuition: Each time a bearing defect passes through the load zone, it produces a brief impact that excites the high-frequency structural resonance of the bearing housing (2-10 kHz). These impacts repeat at the bearing fault frequency. Directly viewing the spectrum only shows high-frequency resonance, not the periodicity of the fault frequency. Envelope analysis = bandpass extract the resonance band → Hilbert envelope → FFT of the envelope → look for fault frequencies in the envelope spectrum.
// Step 4a: Bandpass filter (focus on high-frequency resonance band) // Band selection: use Spectral Kurtosis (SK) to automatically find optimal band // or empirically select 2-10 kHz signal_bp = bandpass(signal, 2000, 10000, order=6) // Step 4b: Hilbert envelope analytic = hilbert(signal_bp) envelope = abs(analytic) // Step 4c: Envelope FFT envelope -= mean(envelope) // remove DC envelope_spectrum = fft(envelope * hann_window) // Step 4d: Search for bearing characteristic frequencies in envelope spectrum // Peaks at BPFO and its harmonics (2×BPFO, 3×BPFO...) → outer race fault // Peaks at BPFI and its harmonics, modulated by 1X → inner race fault // Peaks at BSF and its harmonics → rolling element fault // Peaks at FTF and its harmonics → cage fault
Step 5: Time-Frequency Analysis (Variable Speed Conditions)
If the rotational speed is not constant (run-up, coast-down, load changes), characteristic frequencies change over time. STFT or Order Tracking is needed.
- STFT: Observe frequency changes over time (colorful waterfall plot).
- Order Tracking: Synchronous sampling using a tachometer, converting the time axis to "angle" → characteristic frequencies become fixed "orders."
Step 6: Trend Monitoring
// Acquire vibration data daily (or weekly) // Track trends of key metrics over time: // - Overall RMS vibration level (mm/s) // - RMS in specific frequency bands (e.g., 1X, BPFO band) // - Peak and Crest Factor // Alarm threshold settings: // Method 1: ISO 20816 absolute thresholds // Zone A (< 2.8 mm/s): Good // Zone B (2.8-7.1 mm/s): Acceptable // Zone C (7.1-18 mm/s): Requires attention // Zone D (> 18 mm/s): Dangerous, immediate shutdown // Method 2: Baseline relative thresholds // Warning: Baseline + 6 dB (2x) // Alarm: Baseline + 12 dB (4x) // Sudden trend increase → schedule maintenance (planned shutdown) // 10-100x cheaper than unplanned downtime
Bearing Characteristic Frequencies
\text{BPFO} = \frac{n}{2} f_r \left(1 - \frac{d}{D}\cos\alpha\right) \quad \text{(outer race fault frequency)}
\text{BPFI} = \frac{n}{2} f_r \left(1 + \frac{d}{D}\cos\alpha\right) \quad \text{(inner race fault frequency)}
\text{BSF} = \frac{D}{2d} f_r \left[1 - \left(\frac{d}{D}\cos\alpha\right)^2\right] \quad \text{(ball spin frequency)}
\text{FTF} = \frac{1}{2} f_r \left(1 - \frac{d}{D}\cos\alpha\right) \quad \text{(cage fault frequency)}
where: n = number of rolling elements, fr = shaft speed (Hz), d = rolling element diameter, D = pitch diameter, α = contact angle.
Concrete Example: SKF 6205 Bearing
// SKF 6205 deep groove ball bearing parameters
n = 9 // number of balls
d = 7.94 mm // ball diameter
D = 38.5 mm // pitch diameter
α = 0° // contact angle (deep groove ball bearing)
cos(α) = 1
// Shaft speed: 1800 RPM = 30 Hz
f_r = 30 Hz
// BPFO (outer race fault frequency)
BPFO = (9/2) × 30 × (1 - 7.94/38.5) = 4.5 × 30 × 0.7938
= 4.5 × 30 × 0.7938 = 107.2 Hz
// BPFI (inner race fault frequency)
BPFI = (9/2) × 30 × (1 + 7.94/38.5) = 4.5 × 30 × 1.2062
= 4.5 × 30 × 1.2062 = 162.8 Hz
// BSF (ball spin frequency)
BSF = (38.5 / (2×7.94)) × 30 × [1 - (7.94/38.5)²]
= 2.424 × 30 × [1 - 0.04253]
= 2.424 × 30 × 0.9575 = 69.6 Hz
// FTF (cage fault frequency)
FTF = (1/2) × 30 × (1 - 7.94/38.5) = 15 × 0.7938
= 11.9 Hz
// Diagnostic logic:
// Envelope spectrum peaks at 107.2, 214.4, 321.6 Hz → outer race fault
// Envelope spectrum peaks at 162.8, 325.6 Hz with ±30Hz sidebands → inner race fault
// Envelope spectrum peaks at 69.6, 139.2 Hz → ball fault
// Envelope spectrum peaks at 11.9, 23.8 Hz → cage fault
Application Scenarios
- Petrochemical plant pump monitoring: Centrifugal pumps are the most numerous rotating machines in petrochemical plants. Vibration monitoring provides early warning of imbalance, misalignment, and bearing wear. Each pump has 2-3 accelerometers installed (horizontal, vertical, axial), with online monitoring systems acquiring data every second.
- Wind turbine main bearing: A single main bearing can cost millions of dollars, and replacement requires a large crane (even more costly). Vibration + AE (Acoustic Emission) monitoring detects early damage, providing 3-6 months advance warning.
- CNC spindle: Bearing faults in high-speed spindles (30,000-60,000 RPM) directly affect machining precision. Vibration monitoring is used for cutting force monitoring, tool wear detection, and spindle health management.
- Poor sensor mounting: Handheld probes have high-frequency response only to 1-2 kHz, magnets to 3-5 kHz, stud mounts to 10-20 kHz. Envelope analysis requires high frequencies → mounting quality is critical.
- Unstable speed: Speed variations cause spectral peaks to smear, reducing frequency resolution. Variable frequency drive motors are especially problematic. Solution: speed-synchronized sampling (order tracking).
- Looking only at overall RMS without the spectrum: Overall RMS indicates "severity" but cannot distinguish fault types. High 1X may be imbalance, high 2X may be misalignment — the corrective actions are completely different.
- False positive: seeing BPFO does not necessarily mean bearing failure: Confirm whether modulation is present (envelope spectrum shows BPFO harmonics + modulated by shaft speed). Some structural resonance frequencies may coincidentally be near BPFO, causing misdiagnosis.
- Very low-speed machines (<10 RPM) → acceleration signal is too weak. Use proximity probes or AE instead.
- Non-rotating machines (e.g., pressure vessels, piping) → no clear rotational frequency. Use AE or guided wave.
- Transient events (e.g., gear tooth breakage) → steady-state FFT cannot capture these. Use time-frequency analysis or statistical indicators (kurtosis, crest factor).
- Alternative: Machine learning (ML) automatic feature extraction is replacing some traditional rule-based diagnostics. However, ML model training still requires FFT features as input — the two are complementary, not replacements.
📝 Worked Example
Pump speed 1500RPM, gear tooth count Z=23. (a) 1X frequency? (b) Mesh frequency? (c) 2-second measurement, fs=25.6kHz, FFT with 4096-point Hann window, Δf? (d) Can you resolve the ±fr modulation sidebands near 1X?
Show solution
(a) fr = 1500/60 = 25Hz
(b) GMF = 23×25 = 575Hz
(c) Δf = 25600/4096 = 6.25Hz
(d) Sidebands at 575±25Hz = 550Hz and 600Hz, spacing 25Hz > Δf=6.25Hz → yes, resolvable
✅ Quick Check
Q1: What is the BPFO of a SKF 6205 bearing (9 balls, d=7.94mm, D=38.5mm) at 1800 RPM?
Show answer
fr=30Hz, BPFO = (9/2)×30×(1-7.94/38.5) ≈ 107.1 Hz.
Q2: Why can't vibration analysis rely solely on overall RMS?
Show answer
Because RMS only reflects total energy and cannot distinguish fault types. For example, imbalance (1X) and bearing defects (BPFO) may have similar RMS values, but their spectra are completely different.
Interactive: Vibration Fault Diagnosis Exercise
The system randomly generates a vibration spectrum. Determine the fault type based on spectral characteristics.
2b.1 Discrete-Time Signals
Stepping from the continuous world into the discrete world — the starting point of digital signal processing
One-Sentence Summary: A discrete-time signal is a sequence $x[n]$ defined only at integer time indices $n$; these sequences are the fundamental "atoms" on which all DSP operations are built.
Learning Objectives
- Identify the five fundamental sequences: unit impulse $\delta[n]$, unit step $u[n]$, exponential sequence, sinusoidal sequence, and complex exponential sequence
- Classify signals as Energy Signals or Power Signals
- Perform sequence operations: time shift, folding, and amplitude scaling
- Understand the periodicity condition for discrete sinusoids: $f_0/f_s$ must be rational
- Decompose any sequence into its even and odd components (Even/Odd Decomposition)
Why Learn This: Every DSP algorithm — filters, FFT, modulation/demodulation — ultimately operates on discrete-time sequences. If you do not understand the sifting property of $\delta[n]$, or do not realize that a discrete sinusoid is "not necessarily periodic," you will hit roadblocks when studying the DFT and Z-transform later. This chapter is your "alphabet."
Previously: In the previous section (2a.2 CTFT) we dealt with continuous-time signals $x(t)$ and their spectra $X(f)$. Now we "sample" — keeping only the values at integer time instants — and enter the discrete-time world. The continuous-time Dirac delta $\delta(t)$ becomes the Kronecker delta $\delta[n]$, and integrals become summations.
Pain Point: What Exactly Is Different Between Continuous and Discrete?
When jumping from continuous time to discrete time, many "intuitions" break down:
- $\cos(\omega_0 n)$ is not necessarily periodic! The continuous version $\cos(\omega_0 t)$ always has period $2\pi/\omega_0$, but the discrete version is periodic only when $\omega_0/(2\pi)$ is rational.
- Frequency has an upper limit: The frequency of a discrete signal is meaningful only in $[0, \pi]$ (or $[0, f_s/2]$); $\omega = \pi$ is the Nyquist frequency.
- Exponential sequences can blow up: $\alpha^n u[n]$ diverges when $|\alpha|>1$, giving infinite energy — the continuous world has an analog, but numerical issues arise more easily in the discrete version.
Historical Context: The formalization of discrete-time signals began in the 1940s–50s. Claude Shannon's (1916–2001) sampling theorem (1949) built the bridge between continuous and discrete worlds, and the Z-transform introduced by Ragazzini and Zadeh (1952) brought discrete system analysis to maturity. What truly popularized DSP was the 1965 Cooley–Tukey FFT algorithm, which moved frequency-domain analysis of discrete sequences from theory to practice.
Core Concepts: The Five Fundamental Sequences
Intuition First: Just as chemistry has the periodic table, DSP has its "elements" — these five fundamental sequences. Any discrete signal can be expressed as a linear combination of $\delta[n]$ (this is the foundation of convolution).
| Sequence | Definition | Properties |
|---|---|---|
| Unit Impulse $\delta[n]$ | $\delta[n] = \begin{cases}1, & n=0\\0, & n\neq 0\end{cases}$ | Sifting property: $x[n]\cdot\delta[n-k] = x[k]\cdot\delta[n-k]$ |
| Unit Step $u[n]$ | $u[n] = \begin{cases}1, & n\geq 0\\0, & n < 0\end{cases}$ | $u[n] = \sum_{k=0}^{\infty}\delta[n-k]$ |
| Exponential Sequence | $x[n] = \alpha^n u[n]$ | $|\alpha|<1$: decaying; $|\alpha|>1$: growing |
| Sinusoidal Sequence | $x[n] = A\cos(\omega_0 n + \phi)$ | Period $N$ exists $\iff \omega_0/(2\pi) \in \mathbb{Q}$ |
| Complex Exponential Sequence | $x[n] = e^{j\omega_0 n}$ | $e^{j(\omega_0+2\pi)n} = e^{j\omega_0 n}$ ($2\pi$ periodicity) |
Key Relationship
Decomposition of any sequence via $\delta[n]$:
$$x[n] = \sum_{k=-\infty}^{\infty} x[k]\,\delta[n-k]$$This is the prototype of convolution $x[n] * \delta[n] = x[n]$, and the starting point for LTI system theory in the next chapter.
Energy Signal vs. Power Signal
Energy:
$$E = \sum_{n=-\infty}^{\infty} |x[n]|^2$$Average Power:
$$P = \lim_{N\to\infty} \frac{1}{2N+1}\sum_{n=-N}^{N} |x[n]|^2$$| Type | Condition | Example |
|---|---|---|
| Energy Signal | $0 < E < \infty$, $P = 0$ | $\alpha^n u[n]$ ($|\alpha|<1$), $\delta[n]$ |
| Power Signal | $E = \infty$, $0 < P < \infty$ | $\cos(\omega_0 n)$, $u[n]$ |
| Neither | $E = \infty$, $P = \infty$ | $2^n u[n]$ (exponential growth) |
Expand: Energy computation for $\alpha^n u[n]$
Let $x[n] = \alpha^n u[n]$ with $|\alpha| < 1$:
$$E = \sum_{n=0}^{\infty} |\alpha^n|^2 = \sum_{n=0}^{\infty} |\alpha|^{2n} = \frac{1}{1-|\alpha|^2}$$Since $|\alpha|^2 < 1$, the geometric series converges. For example, $\alpha = 0.8$: $E = 1/(1-0.64) = 2.778$. $\;\blacksquare$
Sequence Operations: Time Shift, Folding, and Scaling
Time Shift: $y[n] = x[n-n_0]$
- $n_0 > 0$: delay (shift right) by $n_0$ samples
- $n_0 < 0$: advance (shift left) by $|n_0|$ samples
Folding (Time Reversal): $y[n] = x[-n]$, mirror-reflected about $n=0$
Amplitude Scaling: $y[n] = c \cdot x[n]$
Caution: In discrete time there is no simple analog of "time compression $x[2n]$"! $x[2n]$ means downsampling, which causes information loss (aliasing). This is fundamentally different from $x(2t)$ in continuous time.
Periodicity of Discrete Sinusoids
$x[n] = \cos(\omega_0 n)$ has period $N$ ($x[n+N]=x[n]$) if and only if:
That is, the normalized frequency $\omega_0/(2\pi)$ must be a rational number. The minimum period is $N = m/\gcd(m,N_0)$ where $\omega_0 = 2\pi N_0/m$.
Examples:
- $\cos(0.3\pi n)$: $0.3\pi/(2\pi) = 0.15 = 3/20$ is rational → period $N=20$
- $\cos(n)$: $1/(2\pi) \approx 0.1592...$ is irrational → aperiodic
- $\cos(\pi n) = (-1)^n$: $\pi/(2\pi) = 1/2$ → period $N=2$
Even/Odd Decomposition
Any real-valued sequence $x[n]$ can be uniquely decomposed as:
Properties: $x_e[-n] = x_e[n]$ (even symmetry), $x_o[-n] = -x_o[n]$ (odd symmetry), and $x_o[0] = 0$.
How to Use: Worked Examples
Worked Example: Determine whether $x[n] = (0.9)^n u[n]$ is an energy signal or a power signal
- Compute the energy: $E = \sum_{n=0}^{\infty} (0.9)^{2n} = \sum_{n=0}^{\infty} (0.81)^n = \frac{1}{1-0.81} = 5.263$
- $E < \infty$, so it is an energy signal
- Average power $P = 0$ (the power of an energy signal is always zero)
Worked Example: Determine the period of $\cos(0.4\pi n)$
- Normalized frequency $f_0 = 0.4\pi/(2\pi) = 0.2 = 1/5$
- $1/5$ is rational, so the sequence is periodic
- Minimum period $N = 5$ ($\cos(0.4\pi(n+5)) = \cos(0.4\pi n + 2\pi) = \cos(0.4\pi n)$)
Applications
- Audio Coding: A CD sampling rate of $f_s=44.1$ kHz produces 44,100 discrete samples per second. Understanding the energy characteristics of sequences determines the bit-allocation strategy for quantization.
- Radar Pulse Compression: The transmitter generates a complex-exponential chirp sequence; the receiver performs matched filtering (essentially convolution) to detect targets.
- Communication Synchronization: The sifting property of $\delta[n]$ is used to design pilot sequences, and cross-correlation is used to estimate timing offsets.
Pitfalls and Limitations
- Discrete-time $\neq$ digital: The amplitude of $x[n]$ is still a continuous value (real number). A truly digital signal also requires quantization, which introduces quantization noise.
- $x[2n]$ is not "speed-up playback": In discrete time, $x[2n]$ means downsampling, which destroys spectral content (aliasing). Do not draw an analogy with the continuous-time $x(2t)$.
- $\cos(n)$ has no period: Many beginners assume all sinusoids are periodic. A discrete sinusoid is periodic only when $\omega_0/(2\pi)$ is rational.
Quick Check
Q1: Is $x[n] = 5\cos(0.6\pi n)$ a periodic sequence? If so, what is the minimum period?
Show answer
$\omega_0/(2\pi) = 0.6\pi/(2\pi) = 0.3 = 3/10$, which is rational, so it is a periodic sequence. We need $0.6\pi \cdot N = 2\pi m$, i.e., $N = 10m/3$. The smallest positive integer $N$: set $m=3$, giving $N=10$. The minimum period is $\mathbf{10}$.
Q2: Is $x[n] = (-1)^n$ an energy signal or a power signal?
Show answer
$|x[n]|^2 = 1$ for all $n$, so $E = \sum_{-\infty}^{\infty} 1 = \infty$. $P = \lim_{N\to\infty}\frac{1}{2N+1}\sum_{-N}^{N} 1 = 1$. Since $0 < P < \infty$, it is a power signal. (Note that $(-1)^n = \cos(\pi n)$, a sinusoidal sequence.)
Interactive: Fundamental Discrete-Time Sequences
Select a signal type and adjust the parameters to observe the stem plot of the discrete-time sequence.
References: [1] Oppenheim & Schafer, Discrete-Time Signal Processing, 3rd ed., Ch.2. [2] Proakis & Manolakis, Digital Signal Processing, 4th ed., Ch.2. [3] Haykin & Van Veen, Signals and Systems, Ch.6.
Interactive: Discrete-Time Signal Stem Plot
Pick a basic discrete sequence. The stem plot (vertical lines + dots) is the standard visualization for discrete-time signals.
2b.2 LTI Systems & Convolution
A single impulse response fully describes an entire system — the most elegant result in DSP
One-Sentence Summary: If a system is Linear and Time-Invariant (LTI), then knowing only its response to $\delta[n]$ — the impulse response $h[n]$ — is enough to compute the output for any input via convolution $y[n] = x[n] * h[n]$.
Learning Objectives
- Define Linearity and Time-Invariance, and determine whether a given system satisfies them
- Derive the Convolution Sum $y[n] = \sum_k x[k]\,h[n-k]$
- Understand how the impulse response $h[n]$ completely characterizes an LTI system
- Apply the commutative, associative, and distributive properties of convolution
- Establish criteria for Causality and BIBO Stability
Why Learn This: Convolution is the most central operation in DSP. Every time you use an FIR filter, create a reverb effect, or compute cross-correlations for radar target detection, convolution is at work. Understanding LTI theory tells you: Why is a single impulse-response measurement sufficient? Why is the frequency response the DTFT of $h[n]$? Why does cascading two filters equal convolving their impulse responses?
Previously: In the previous section we learned to represent any signal as $x[n] = \sum_k x[k]\,\delta[n-k]$ — a weighted sum of delayed impulses. The key question now is: if we know the system's response to $\delta[n]$, can we compute its response to $x[n]$? The answer is yes, provided the system is LTI.
Pain Point: Why Can't We Test Every Input Individually?
Imagine you have designed a filter and want to know how it behaves for every possible input:
- The input space is infinite-dimensional — you cannot exhaustively test all $x[n]$
- If the system is nonlinear (e.g., $y[n] = x[n]^2$), knowing the response to $\delta[n]$ is entirely insufficient
- The LTI assumption is the "master key": one test $\to$ complete characterization
Historical Context: The concept of convolution traces back to the integral operations of Euler and Laplace in the 18th century. In the 1930s, Norbert Wiener used convolution extensively in the theory of stochastic processes. Discrete convolution became a standard signal processing tool in the 1960s with the rise of digital computers. The invention of the FFT in 1965 made "fast convolution" possible — replacing $O(N^2)$ direct computation with $O(N\log N)$ FFT-based convolution.
Core Concepts: What Is an LTI System?
Intuition First: Think of a black box $\mathcal{T}\{\cdot\}$ that takes in a sequence and produces a sequence.
Linearity = Homogeneity + Additivity (Superposition):
$$\mathcal{T}\{a\,x_1[n] + b\,x_2[n]\} = a\,\mathcal{T}\{x_1[n]\} + b\,\mathcal{T}\{x_2[n]\}$$Time-Invariance:
$$\text{If } \mathcal{T}\{x[n]\} = y[n], \text{ then } \mathcal{T}\{x[n-n_0]\} = y[n-n_0]$$The system's response to a delayed input equals the delayed output.
How to Determine: Common Examples
| System | Linear? | Time-Invariant? | LTI? |
|---|---|---|---|
| $y[n] = 3x[n] + 2x[n-1]$ | Yes | Yes | Yes |
| $y[n] = x[n]^2$ | No | Yes | No |
| $y[n] = n \cdot x[n]$ | Yes | No | No |
| $y[n] = x[n] + 1$ | No | Yes | No |
| $y[n] = x[-n]$ | Yes | No | No |
Derivation of the Convolution Sum
Intuition: The input $x[n]$ is a weighted sum of delayed impulses. Linearity + time-invariance of an LTI system means each impulse's response is also weighted, delayed, and superposed.
Expand: Full derivation of the convolution sum
Step 1: Decompose any signal $x[n]$ as a weighted sum of delayed impulses:
$$x[n] = \sum_{k=-\infty}^{\infty} x[k]\,\delta[n-k]$$Step 2: Define the impulse response $h[n] = \mathcal{T}\{\delta[n]\}$. By time-invariance:
$$\mathcal{T}\{\delta[n-k]\} = h[n-k]$$Step 3: By linearity (superposition):
$$y[n] = \mathcal{T}\{x[n]\} = \mathcal{T}\left\{\sum_{k} x[k]\,\delta[n-k]\right\} = \sum_{k} x[k]\,\mathcal{T}\{\delta[n-k]\} = \sum_{k} x[k]\,h[n-k]$$Conclusion:
$$\boxed{y[n] = \sum_{k=-\infty}^{\infty} x[k]\,h[n-k] \;\equiv\; x[n] * h[n]}$$$\;\blacksquare$
Convolution Sum
$$y[n] = x[n] * h[n] = \sum_{k=-\infty}^{\infty} x[k]\,h[n-k]$$"Flip, slide, multiply, sum" — these four steps are the recipe for hand-computing convolution.
Properties of Convolution
| Property | Formula | Engineering Significance |
|---|---|---|
| Commutative | $x * h = h * x$ | Input and impulse response roles are interchangeable |
| Associative | $(x * h_1) * h_2 = x * (h_1 * h_2)$ | Cascaded filters = convolution of impulse responses |
| Distributive | $x * (h_1 + h_2) = x*h_1 + x*h_2$ | Parallel filters = sum of impulse responses |
| Identity Element | $x * \delta = x$ | $\delta[n]$ is the "1" of convolution |
Causality and BIBO Stability
Causal System: The output depends only on the present and past inputs.
$$\text{LTI Causal} \iff h[n] = 0 \text{ for } n < 0$$BIBO Stability (Bounded-Input Bounded-Output): Bounded input → bounded output.
$$\text{LTI BIBO Stable} \iff \sum_{n=-\infty}^{\infty} |h[n]| < \infty$$The impulse response must be absolutely summable.
Expand: Proof of BIBO stability
Sufficiency ($\Leftarrow$): If $\sum|h[n]| < \infty$ and $|x[n]| \leq B_x$, then:
$$|y[n]| = \left|\sum_k x[k]\,h[n-k]\right| \leq \sum_k |x[k]|\,|h[n-k]| \leq B_x \sum_k |h[k]| < \infty$$Necessity ($\Rightarrow$): If $\sum|h[n]| = \infty$, construct the bounded input $x[n] = \text{sgn}(h[-n])$. Then $|x[n]| \leq 1$ but $y[0] = \sum_k |h[k]| = \infty$, so the output is unbounded. $\;\blacksquare$
How to Use: Hand-Computing Convolution
Compute the convolution of $x[n] = \{1, 2, 3\}$ ($n=0,1,2$) and $h[n] = \{1, 1, 1\}/3$ (3-point moving average)
- Flip $h[k]$: $h[-k] = \{1, 1, 1\}/3$ (symmetric, so flipping has no effect)
- Slide $h[n-k]$ and compute the product-sum for each $n$:
Result: $y[n] = \{1/3, 1, 2, 5/3, 1\}$, length = $3 + 3 - 1 = 5$.
Applications
- FIR Digital Filters: Every FIR filter is essentially "convolving the input with the filter coefficients." The filter coefficients are the impulse response $h[n]$.
- Audio Reverb: Record a room's response $h[n]$ to a hand clap (approximating $\delta[n]$), then convolve it with any dry signal to simulate playing in that room.
- Communication Channel Modeling: A wireless channel can be modeled by a multipath impulse response $h[n]$; the received signal = transmitted signal $*$ channel impulse response + noise.
Pitfalls and Limitations
- Convolution applies only to LTI systems: If the system is nonlinear (e.g., a compressor) or time-varying (e.g., LFO modulation), convolution does not apply.
- Direct convolution costs $O(NM)$: For long sequences, always use FFT-based fast convolution ($O(N\log N)$).
- Linear convolution vs. circular convolution: The DFT computes circular convolution! To perform linear convolution via the DFT, you must zero-pad. This is the most common mistake beginners make.
Quick Check
Q1: Is the system $y[n] = x[n] \cdot x[n-1]$ LTI? Why or why not?
Show answer
No. It is time-invariant (delayed input → delayed output), but nonlinear. Verification: let $x_1[n]=1, x_2[n]=1$; then $\mathcal{T}\{x_1+x_2\} = 2\cdot 2 = 4$, but $\mathcal{T}\{x_1\}+\mathcal{T}\{x_2\} = 1+1 = 2 \neq 4$, violating superposition.
Q2: If $h[n] = (0.5)^n u[n]$, is the system BIBO stable?
Show answer
$\sum_{n=0}^{\infty}|h[n]| = \sum_{n=0}^{\infty}(0.5)^n = \frac{1}{1-0.5} = 2 < \infty$. The impulse response is absolutely summable, so the system is BIBO stable.
Interactive: Convolution Sliding Animation
Select the input $x[n]$ and impulse response $h[n]$, then drag the slider to watch $h[n-k]$ slide across $x[k]$. The product area (green) is the value of $y[n]$ at the current $n$.
References: [1] Oppenheim & Schafer, Discrete-Time Signal Processing, 3rd ed., Ch.2. [2] Proakis & Manolakis, Digital Signal Processing, 4th ed., Ch.2. [3] Haykin & Van Veen, Signals and Systems, Ch.2.
2b.3 Difference Equations & System Function $H(z)$
From time-domain recursion to Z-domain algebra — a unified view of FIR and IIR
One-Sentence Summary: The Linear Constant-Coefficient Difference Equation (LCCDE) is the time-domain language for describing LTI systems; applying the M2B.4 Z-Transform converts it into the algebraic expression $H(z) = B(z)/A(z)$, where system stability and frequency response are entirely encoded in the poles and zeros.
Learning Objectives
- Write out the general LCCDE and understand the roles of $a_k$ (feedback coefficients) and $b_k$ (feedforward coefficients)
- Apply the Z-transform to the LCCDE to derive the system function $H(z) = B(z)/A(z)$
- Distinguish FIR (all $a_k=0$, zeros only) from IIR (has poles, requires stability analysis)
- Read stability from the pole-zero plot (causal system: all poles inside the unit circle)
- Compute the frequency response $H(e^{j\omega})$ from $H(z)$
Why Learn This: The difference equation is the "blueprint" for DSP hardware and software implementations — each $b_k$ is a multiply-add operation, and each $a_k$ is a feedback loop. The system function $H(z)$ lets you see at a glance whether a system is stable (poles inside the unit circle) or about to blow up (poles outside the circle). Filter design tools (MATLAB fdatool, Python scipy.signal) all operate on the $b_k, a_k$ coefficients.
Previously: The previous section established the convolution relation $y[n] = x[n] * h[n]$ for LTI systems. But convolution is an infinite summation — how does a real system implement it with finite memory? The answer: describe the behavior of $h[n]$ with a difference equation, using feedback to replace an infinitely long impulse response. The Z-transform then converts this recursive relation into an algebraically manipulable polynomial ratio.
Pain Point: Convolution Is Too Slow, Impulse Response Is Too Long
- A first-order IIR low-pass filter with $h[n] = a^n u[n]$ has an infinitely long impulse response; direct convolution requires infinite computation
- But the difference equation $y[n] = x[n] + a\,y[n-1]$ needs only 1 multiplication + 1 addition per sample via recursion
- The catch: feedback introduces a stability risk — if $|a| \geq 1$, the output will blow up
- We need a tool to quickly assess stability → pole-zero analysis of $H(z)$
Historical Context: The history of difference equations dates back to de Moivre (1718) solving linear recurrence relations. The Z-transform was introduced to control theory by Witold Hurewicz in 1947, and Ragazzini and Zadeh (1952) systematically applied it to sampled-data systems. After digital filter theory matured in the 1960s, pole-zero analysis of $H(z)$ became an everyday tool for DSP engineers. E. Christian and E. Eisenmann were among the first to convert analog circuit filters into digital difference equation implementations.
Core Concepts: The General LCCDE
General Form
$$y[n] + \sum_{k=1}^{N} a_k\,y[n-k] = \sum_{k=0}^{M} b_k\,x[n-k]$$$b_k$: Feedforward coefficients | $a_k$: Feedback coefficients | Order $= \max(N, M)$
Intuition: The left side contains $y[n-k]$ (past outputs) → this is "feedback" → creating a recursive structure. The right side contains only $x[n-k]$ (past inputs) → this is "feedforward" → a non-recursive structure.
Deriving $H(z)$ via the Z-Transform
Using the time-shift property of the Z-transform: $\mathcal{Z}\{x[n-k]\} = z^{-k}X(z)$.
Expand: Derivation of $H(z)$
Apply the Z-transform to both sides of the LCCDE:
$$Y(z) + \sum_{k=1}^{N} a_k\,z^{-k}\,Y(z) = \sum_{k=0}^{M} b_k\,z^{-k}\,X(z)$$Factor out $Y(z)$ and $X(z)$:
$$Y(z)\left(1 + \sum_{k=1}^{N} a_k\,z^{-k}\right) = X(z)\left(\sum_{k=0}^{M} b_k\,z^{-k}\right)$$Define the system function:
$$H(z) = \frac{Y(z)}{X(z)} = \frac{\sum_{k=0}^{M} b_k\,z^{-k}}{1 + \sum_{k=1}^{N} a_k\,z^{-k}} = \frac{B(z)}{A(z)}$$$\;\blacksquare$
System Function (Transfer Function)
$$H(z) = \frac{B(z)}{A(z)} = \frac{b_0 + b_1 z^{-1} + b_2 z^{-2} + \cdots + b_M z^{-M}}{1 + a_1 z^{-1} + a_2 z^{-2} + \cdots + a_N z^{-N}}$$Factoring the numerator and denominator:
$$H(z) = b_0 \cdot \frac{\prod_{i=1}^{M}(1 - q_i z^{-1})}{\prod_{i=1}^{N}(1 - p_i z^{-1})}$$$q_i$: Zeros, $H(q_i)=0$ | $p_i$: Poles, $H(p_i) \to \infty$
FIR vs. IIR: A Comparison
| Characteristic | FIR (Finite Impulse Response) | IIR (Infinite Impulse Response) |
|---|---|---|
| Difference Equation | $y[n] = \sum_{k=0}^{M} b_k\,x[n-k]$ | $y[n] + \sum a_k\,y[n-k] = \sum b_k\,x[n-k]$ |
| $H(z)$ | Polynomial ($B(z)$ only) | Rational function $B(z)/A(z)$ |
| Poles | Only at $z=0$ (always stable) | At $z=p_i$; requires $|p_i|<1$ |
| Stability | Unconditionally stable | Depends on pole locations |
| Impulse Response Length | Finite ($M+1$ points) | Theoretically infinite |
| Computation | Requires more coefficients for steep roll-off | Fewer coefficients achieve narrow-band filtering |
Stability Criterion: Poles and the Unit Circle
A causal LTI system is BIBO stable $\iff$ all poles of $H(z)$ satisfy $|p_i| < 1$
That is, all poles lie strictly inside the unit circle in the Z-plane.
Intuition: The natural mode associated with pole $p_i$ is $p_i^n$. If $|p_i|<1$, then $p_i^n \to 0$ (decaying); if $|p_i|>1$, then $p_i^n \to \infty$ (blowing up); if $|p_i|=1$, oscillation persists without decay (marginally stable).
Frequency Response: $H(z)$ Evaluated on the Unit Circle
$|H(e^{j\omega})|$: Magnitude Response | $\angle H(e^{j\omega})$: Phase Response
Geometric relationship between poles/zeros and frequency response: At frequency $\omega$, the magnitude $|H(e^{j\omega})|$ is proportional to "the product of distances from $e^{j\omega}$ to each zero / the product of distances from $e^{j\omega}$ to each pole." Near zeros the magnitude is small (notches); near poles the magnitude is large (peaks).
How to Use: Complete First-Order IIR Example
Problem: Analyze the system $y[n] = x[n] + 0.8\,y[n-1]$.
- Identify coefficients: $b_0 = 1$, $a_1 = -0.8$ (note the sign convention $y[n] + a_1 y[n-1]$ in the difference equation)
- System function:$$H(z) = \frac{1}{1 - 0.8\,z^{-1}} = \frac{z}{z - 0.8}$$
- Poles and zeros: Zero at $z=0$, pole at $z=0.8$ (inside the unit circle → stable)
- Impulse response:$$h[n] = (0.8)^n\,u[n]$$ (exponentially decaying, infinitely long → IIR)
- Frequency response:$$|H(e^{j\omega})| = \frac{1}{|1 - 0.8e^{-j\omega}|}$$ At $\omega=0$: $|H|=1/0.2=5$ (high gain at low frequencies); at $\omega=\pi$: $|H|=1/1.8\approx 0.56$ (attenuation at high frequencies) → low-pass filter
Applications
- Audio Equalizer: Constructed by cascading several second-order IIR filters (biquads), each with 2 poles + 2 zeros, corresponding to the difference equation coefficients $b_0, b_1, b_2, a_1, a_2$.
- PID Control Systems: A digital PID controller can be expressed as a difference equation. Z-transform analysis lets you determine closed-loop stability directly from the pole-zero plot.
- Communication Channel Equalizer: The receiver designs an IIR equalizer $H_{eq}(z) \approx 1/H_{ch}(z)$ to remove channel frequency distortion. One must ensure the zeros of $H_{ch}(z)$ are not outside the unit circle (otherwise the equalizer's poles move outside → unstable).
Pitfalls and Limitations
- Quantization Effects: In fixed-point IIR implementations, coefficient quantization can shift poles outside the unit circle, turning a previously stable filter unstable. Lower-order IIR filters are safer than higher-order ones.
- Poles on the unit circle $\neq$ stable: $|p_i|=1$ represents marginal instability (an oscillator). BIBO stability requires the magnitude to be strictly less than 1.
- Non-Minimum Phase Zeros: If zeros lie outside the unit circle, the causal inverse (equalizer) is unstable. Special treatment is needed (e.g., allpass decomposition).
Quick Check
Q1: Why is an FIR filter "unconditionally stable"? Explain from the $H(z)$ perspective.
Show answer
The denominator of an FIR $H(z)$ is 1 (no feedback $a_k$), so $H(z)=B(z)$ is a polynomial. The only poles are at $z=0$ ($N$-fold pole), and $|0|<1$ is always inside the unit circle. Therefore, regardless of the values of $b_k$, the system is always BIBO stable.
Q2: If $H(z) = \frac{1-z^{-1}}{1-0.95z^{-1}}$, where are the zeros and poles? Is the system low-pass or high-pass?
Show answer
Zero: $1-z^{-1}=0 \Rightarrow z=1$ (on the unit circle, at $\omega=0$). Pole: $1-0.95z^{-1}=0 \Rightarrow z=0.95$ (inside the unit circle → stable). At $\omega=0$: $|H(e^{j0})|=|1-1|/|1-0.95|=0$ (zero gain). At $\omega=\pi$: $|H(e^{j\pi})|=|1+1|/|1+0.95|=2/1.95\approx 1.03$. The gain is zero at DC → this is a high-pass filter.
Interactive: Pole-Zero Plot, Frequency Response, and Impulse Response
Adjust the $b_k$ and $a_k$ coefficients and observe in real time the pole-zero locations on the Z-plane, the frequency response $|H(e^{j\omega})|$, and the impulse response $h[n]$.
References: [1] Oppenheim & Schafer, Discrete-Time Signal Processing, 3rd ed., Ch.3, 5, 6. [2] Mitra, Digital Signal Processing: A Computer-Based Approach, 4th ed., Ch.4. [3] Proakis & Manolakis, Digital Signal Processing, 4th ed., Ch.3.
4B.1 IIR Filter Design
From analog prototypes to digital implementation — a complete comparison of four classic IIR designs
Learning Objectives
- Understand the advantages and trade-offs of IIR filters relative to M4A FIR Design
- Compare the frequency-response characteristics of Butterworth / Chebyshev I / Chebyshev II / Elliptic designs
- Master the derivation of the Bilinear Transform and frequency pre-warping
- Complete the full design flow from specifications to $H(z)$
One-Sentence Summary
IIR filters use a recursive (feedback) structure to approximate ideal frequency responses with very low order; the four classic designs represent different trade-offs between passband flatness and transition-band steepness.
Why Learn This?
In the era of analog circuits, all filters were inherently IIR — RLC networks composed of capacitors, inductors, and resistors are recursive systems. Stephen Butterworth (1930) proposed the "maximally flat" criterion in an unassuming British radio engineering paper, laying the foundation for Butterworth filters.
Pafnuty Chebyshev (19th-century Russian mathematician) developed an equiripple approximation theory that was applied to filter design half a century later, giving rise to Chebyshev Type I and Type II filters. Wilhelm Cauer (1931) used elliptic function theory to derive the most efficient elliptic filters.
With the advent of the digital era, these analog prototypes were "ported" to the $z$-domain via the Bilinear Transform — allowing us to inherit decades of analog design wisdom while enjoying the precision and flexibility of digital implementation.
Previously...
In FIR design (Module 4A), we learned the advantages of FIR filters: linear phase, unconditional stability, and intuitive design. However, FIR has a fundamental limitation —
- To achieve a steep transition band, FIR requires very high order (hundreds or even thousands)
- High order = high computation = high latency (group delay $\approx (N-1)/2$ samples)
- In real-time systems (audio, control), such latency is unacceptable
IIR filters use feedback (recursion) to "remember" past outputs, achieving the same or even better frequency selectivity with far fewer coefficients. The price: nonlinear phase, potential instability, and more mathematical complexity in the design process.
Pain Point: FIR Is Too "Expensive"
Suppose you need a low-pass filter: passband up to 1 kHz, stopband starting at 1.2 kHz, stopband attenuation 60 dB, sampling rate 8 kHz.
- FIR approach: Using the Kaiser window method, the estimated order is $N \approx \frac{A_s - 7.95}{2.285 \cdot \Delta\omega} \approx \frac{60 - 7.95}{2.285 \times 0.05\pi} \approx 145$. Each output sample requires 145 multiply-accumulate operations.
- IIR approach: A 4th-order Elliptic filter meets the same specification. Each output sample requires only 8 multiply-accumulates — a 18x reduction in computation.
In embedded systems, real-time audio processing, and high-speed communications, this difference is decisive.
Origin: From the $s$-Domain to the $z$-Domain
The design approach for IIR digital filters is not to design $H(z)$ directly, but rather:
- Use well-established analog filter design theory to obtain an analog prototype $H_a(s)$
- Apply an $s \to z$ mapping to "translate" the analog filter into a digital filter $H(z)$
Why not design $H(z)$ directly? Because analog filter design has a century of accumulated knowledge — formula tables, charts, closed-form solutions — borrowing from this body of work is far more economical than deriving everything from scratch.
Core Concepts: The Four Classic IIR Filters
Intuition: The four designs differ in how they allocate "approximation error" — you can make the passband as flat as possible (Butterworth), distribute the passband error uniformly to gain a steeper transition band (Chebyshev I), place the error in the stopband (Chebyshev II), or allow equiripple on both sides for the steepest possible transition band (Elliptic).
1. Butterworth (Maximally Flat)
As flat as possible in the passband (the first $2N-1$ derivatives are zero at $\Omega=0$), but the transition-band roll-off is the slowest.
2. Chebyshev Type I (Passband Equiripple)
$T_N$ is the $N$th-order Chebyshev polynomial; $\varepsilon$ controls the passband ripple magnitude. The passband has ripple, but the roll-off is steeper than Butterworth.
3. Chebyshev Type II (Stopband Equiripple)
The passband is flat; ripple appears in the stopband. Stopband zeros provide better stopband attenuation.
4. Elliptic / Cauer (Equiripple on Both Sides)
$R_N$ is a rational Chebyshev function (ratio of Jacobi elliptic functions). Both the passband and stopband have equiripple, but the transition band is the steepest for a given order — this is the theoretically optimal solution.
Comparison of the Four Designs
| Characteristic | Butterworth | Chebyshev I | Chebyshev II | Elliptic |
|---|---|---|---|---|
| Passband | Maximally flat | Equiripple | Flat | Equiripple |
| Stopband | Monotonic decay | Monotonic decay | Equiripple | Equiripple |
| Transition Steepness | Gentlest (needs high order) | Moderate | Moderate | Steepest (fewest order) |
| Phase Linearity | Best | Worse | Worse | Worst |
| Design Parameters | $N, \Omega_c$ | $N, \Omega_c, \varepsilon$ | $N, \Omega_s, \varepsilon$ | $N, \Omega_c, \varepsilon_p, \varepsilon_s$ |
| Typical Use | Anti-aliasing, general | Frequency selection | Flat passband needed | Stringent specs |
Expand: Derivation of Butterworth pole locations
The poles of a Butterworth filter lie on a left-half semicircle in the $s$-plane. Starting from $|H_a(j\Omega)|^2 = 1/(1+(\Omega/\Omega_c)^{2N})$:
Let $s = j\Omega$, then $H_a(s)H_a(-s) = 1/(1+(-s^2/\Omega_c^2)^N)$.
Poles occur where $(-s^2/\Omega_c^2)^N = -1 = e^{j(2k+1)\pi}$, giving:
$$s_k = \Omega_c \, e^{j\pi(2k+N+1)/(2N)}, \quad k = 0, 1, \ldots, 2N-1$$There are $2N$ poles uniformly distributed on a circle of radius $\Omega_c$. Selecting the $N$ poles in the left half-plane gives the stable $H_a(s)$:
$$H_a(s) = \frac{\Omega_c^N}{\prod_{k=0}^{N-1}(s - s_k)}, \quad \text{Re}(s_k) < 0$$For example, when $N=2$, the poles are at $\pm 135°$ and $\pm 225°$ (select the left-half-plane poles at $135°$ and $225°$):
$$s_{0,1} = \Omega_c\,e^{j3\pi/4},\; \Omega_c\,e^{j5\pi/4} = \Omega_c\left(-\frac{1}{\sqrt{2}} \pm j\frac{1}{\sqrt{2}}\right)$$ $$H_a(s) = \frac{\Omega_c^2}{s^2 + \sqrt{2}\,\Omega_c\,s + \Omega_c^2} \quad\blacksquare$$Bilinear Transform
Intuition: We need to map the $s$-plane to the $z$-plane while ensuring that analog stability (left half-plane) maps to digital stability (inside the unit circle). The bilinear transform is exactly such a perfect mapping.
Bilinear Transform Formula
$$s = \frac{2}{T}\,\frac{z-1}{z+1} \quad \Longleftrightarrow \quad z = \frac{1 + (T/2)s}{1 - (T/2)s}$$Frequency Mapping (Frequency Warping):
Let $s = j\Omega$, $z = e^{j\omega}$, and substitute to get:
The analog frequency $\Omega \in [0, \infty)$ is compressed into the digital frequency $\omega \in [0, \pi)$. At low frequencies the mapping is approximately linear ($\Omega \approx \omega/T$); at high frequencies severe warping occurs.
Frequency Pre-warping: During design, first convert the desired digital cutoff frequency $\omega_c$ back to the analog frequency $\Omega_c = (2/T)\tan(\omega_c/2)$. Use $\Omega_c$ to design the analog prototype so that, after the transform, the digital filter's cutoff falls precisely at $\omega_c$.
Expand: Derivation of the bilinear transform (trapezoidal integration)
The time-domain representation of an analog system $H_a(s) = Y(s)/X(s)$ is a differential equation. The simplest digitization approach is to approximate derivatives with numerical integration.
The Trapezoidal Rule approximates integration as:
$$y[n] = y[n-1] + \frac{T}{2}\big(x[n] + x[n-1]\big)$$Taking the $z$-transform: $Y(z) = z^{-1}Y(z) + \frac{T}{2}(1+z^{-1})X(z)$
$$\frac{Y(z)}{X(z)} = \frac{T/2 \cdot (1+z^{-1})}{1 - z^{-1}} = \frac{T}{2}\,\frac{z+1}{z-1}$$This is the digital approximation of $1/s$, hence $s$ corresponds to $\frac{2}{T}\frac{z-1}{z+1}$.
Proof that stability is preserved: Let $z = re^{j\theta}$, substitute into $s = \frac{2}{T}\frac{re^{j\theta}-1}{re^{j\theta}+1}$, and verify:
- $|z| < 1$ (inside unit circle) $\Rightarrow$ $\text{Re}(s) < 0$ (left half-plane)
- $|z| = 1$ (on unit circle) $\Rightarrow$ $\text{Re}(s) = 0$ (imaginary axis)
- $|z| > 1$ (outside unit circle) $\Rightarrow$ $\text{Re}(s) > 0$ (right half-plane)
Therefore, a stable analog system is guaranteed to remain stable after the bilinear transform. $\;\blacksquare$
How to Use: Complete Design Example
Goal: Design a 4th-order Butterworth low-pass filter with $f_c = 1\,\text{kHz}$ and $f_s = 8\,\text{kHz}$.
Step 1: Compute the digital cutoff frequency
$\omega_c = 2\pi f_c / f_s = 2\pi \times 1000 / 8000 = \pi/4 \;\text{rad}$
Step 2: Pre-warp to the analog frequency
Set $T = 1$ (for simplicity): $\Omega_c = \frac{2}{T}\tan\!\left(\frac{\omega_c}{2}\right) = 2\tan\!\left(\frac{\pi}{8}\right) \approx 2 \times 0.4142 = 0.8284$
Step 3: Design the analog Butterworth prototype
The 4th-order Butterworth poles lie on a circle of radius $|\Omega_c|$, at angles $\theta_k = \pi(2k+5)/8$, $k=0,1,2,3$:
$s_{0,3} = 0.8284\,e^{j5\pi/8},\; 0.8284\,e^{j7\pi/8}$ and their conjugates
Split into two second-order sections:
$H_a(s) = \frac{\Omega_c^2}{s^2 + 2\cos(\pi/8)\,\Omega_c\,s + \Omega_c^2} \cdot \frac{\Omega_c^2}{s^2 + 2\cos(3\pi/8)\,\Omega_c\,s + \Omega_c^2}$
Step 4: Apply the bilinear transform to each section
Substitute $s = 2(z-1)/(z+1)$ and simplify to get two digital biquad sections $H_1(z)$ and $H_2(z)$.
Step 5: Cascade to obtain the final filter
$H(z) = H_1(z) \cdot H_2(z)$, implemented in SOS (Second-Order Sections) form.
Applications
| Application | Recommended Type | Rationale |
|---|---|---|
| Anti-aliasing Filter | Butterworth | Flat passband avoids signal distortion |
| Audio Equalizer | Butterworth / Chebyshev II | Passband flatness is critical |
| Communication Channel Selection | Elliptic | Steeper transition band is better; phase distortion can be compensated with an equalizer |
| Biomedical Signals (ECG Filtering) | Butterworth | No passband ripple allowed; better phase characteristics needed |
| Radar/Sonar Receiver | Chebyshev I / Elliptic | Strict frequency selectivity requirements |
Pitfalls and Limitations
1. Nonlinear Phase: The phase response of an IIR filter is not linear, meaning different frequency components experience different delays after passing through the filter. In applications requiring waveform fidelity (e.g., ECG, seismic wave analysis), this can cause waveform distortion. Solution: use zero-phase filtering (forward-backward filtering, i.e., MATLAB's filtfilt), though this works only offline.
2. Stability Risk: IIR poles must lie inside the unit circle. In high-order direct-form implementations, coefficient quantization can push poles outside the circle, causing instability. Solution: use cascade second-order sections (SOS) implementation.
3. High-Frequency Warping: The bilinear transform has severe frequency compression at high frequencies. Specifications near the Nyquist frequency (e.g., stopband edge above $0.45\pi$) will be inaccurate due to warping. Always use pre-warping.
Rule of Thumb: If your application does not require linear phase and the transition band is narrow, try IIR first (Butterworth is usually sufficient). If linear phase is needed, use FIR.
Quick Check
Q1: For the same filter order $N$, which design produces the steepest transition band? Why?
Answer
The Elliptic filter. Because it allows equiripple in both the passband and the stopband, distributing the approximation error "evenly" across both bands. According to Chebyshev approximation theory, this is the optimal strategy for minimizing the maximum deviation at a given order. Butterworth concentrates all approximation precision near $\Omega=0$ (maximally flat), so the approximation deteriorates away from $\Omega_c$ and requires a higher order to achieve the same stopband attenuation.
Q2: What happens if you forget to apply frequency pre-warping in the bilinear transform?
Answer
Because the bilinear transform's frequency mapping $\Omega = (2/T)\tan(\omega/2)$ is nonlinear, omitting pre-warping causes the actual cutoff frequency of the digital filter to be lower than intended (since $\tan(\omega/2) < \omega/2$ for $\omega > 0$, i.e., the analog frequency is compressed). The higher the frequency, the greater the deviation; near Nyquist, the distortion is extreme. Pre-warping reverses this by first converting the desired digital cutoff frequency to the corresponding analog frequency, so the cutoff falls precisely at the correct position after the transform.
Interactive: Comparing the Four IIR Filter Designs
Select a filter type and parameters to compare the magnitude responses of the four classic designs in real time.
Observe: Butterworth is smooth but rolls off slowly; Elliptic is the steepest but has ripple. Adjust the order to see convergence speed differences.
All-Pass Filters and Minimum-Phase Decomposition
Any stable causal system can be decomposed as "minimum-phase $\times$ all-pass" — this is a key theoretical result for IIR design and phase equalization.
All-Pass Filter
Definition: $|H_{ap}(e^{j\omega})| = 1$ for all $\omega$ (the magnitude is the same at every frequency)
Typical form:
Each pole $a$ is paired with a mirror zero $1/a^*$ (outside the unit circle). The combination of pole plus mirror zero makes the magnitude response identically equal to 1, while the phase response is non-zero — this is why all-pass filters adjust phase without changing magnitude.
Minimum-Phase System
Definition: all zeros and poles lie inside the unit circle ($|z|<1$)
Key properties:
- Given a magnitude response $|H(e^{j\omega})|$, the minimum-phase realization has the smallest group delay among all causal realizations
- It is causally invertible ($1/H_{min}(z)$ is also stable and causal)
- Energy is concentrated near the beginning of the signal (peak arrives earliest)
Minimum-Phase + All-Pass Decomposition Theorem
Any stable causal system $H(z)$ can be decomposed as:
$$H(z) = H_{min}(z) \cdot H_{ap}(z)$$where $H_{min}$ is minimum-phase and $H_{ap}$ is all-pass.
Derivation: Why does this decomposition work?
For every zero $z_0$ of $H(z)$ that lies outside the unit circle ($|z_0|>1$), "reflect" it to $1/z_0^*$ inside the unit circle, and add an all-pass factor to compensate for the difference:
$$\frac{1 - z_0 z^{-1}}{1} = \underbrace{\frac{1 - z_0^{-1*}z^{-1}}{1}}_{\text{minimum phase}} \cdot \underbrace{\frac{z^{-1} - z_0^{*}}{1 - z_0^{-1*}z^{-1}}}_{\text{all-pass}}$$Expanding verifies: the product of the two factors on the right equals the original left-hand side. Thus the original "outside zero" becomes "an inside zero plus an all-pass rotation." $\blacksquare$
Applications:
- Filter inversion: to compute $H^{-1}(z)$, one must first separate the non-minimum-phase part
- Phase equalization: cascading an all-pass filter after the original filter changes the phase response without affecting the magnitude
- Spectral shaping: all systems with the same magnitude but different phases share the same $|H|$; they differ only in the all-pass component
- System identification: given $|H|$, the minimum-phase realization is the "simplest" causal implementation
References
- [1] A. V. Oppenheim & R. W. Schafer, Discrete-Time Signal Processing, 3rd ed., Ch. 7.
- [2] S. K. Mitra, Digital Signal Processing: A Computer-Based Approach, Ch. 8.
- [3] L. B. Jackson, Digital Filters and Signal Processing, Ch. 6.
- [4] S. Butterworth, "On the Theory of Filter Amplifiers," Wireless Engineer, 1930.
4C.1 Filter Realization Structures
Same transfer function, different structures = different numerical fates
Learning Objectives
- Understand that a single $H(z)$ can be implemented with multiple equivalent structures
- Compare Direct Form I / II, Cascade (SOS), Parallel, and Lattice structures
- Understand why Cascade/SOS is the industry standard under finite-precision arithmetic
- Gain intuition through float32 vs. float64 experiments on how structure affects numerical stability
One-Sentence Summary
The same $H(z)$ implemented in different structures is mathematically identical under infinite precision — but under finite precision (fixed-point/floating-point), the choice of structure determines whether the filter works accurately or breaks down entirely.
Why Learn This?
In the late 1960s, digital filters began to be used in military radar and space missions. Engineers implemented high-order IIR filters using the theoretically correct Direct Form, only to find that the filters completely failed in hardware — outputs exploded, self-oscillation occurred, and frequency responses were unrecognizable.
Research by James Kaiser and Clifford Weinstein (MIT Lincoln Lab, ~1969) revealed the cause: finite word-length effects. When an IIR filter's poles are very close to the unit circle (common in narrowband filters), Direct Form coefficients are extremely sensitive to quantization — tiny rounding errors can push poles outside the circle.
The solution was to decompose high-order filters into a cascade of Second-Order Sections (SOS) — each section has only two poles, dramatically reducing coefficient sensitivity. This lesson remains a core DSP engineering principle to this day.
Previously...
In Module 4B, we learned how to design the transfer function $H(z) = B(z)/A(z)$ of an IIR filter. But a transfer function only describes the input-output relationship — it does not specify how the internals are wired. The same $H(z)$ can be realized with entirely different arrangements of delays, adders, and multipliers; each arrangement is a "structure."
Pain Point: "Theoretically Correct" Does Not Mean "Practically Usable"
Consider an 8th-order narrowband bandpass IIR filter (center frequency $\omega_0 = 0.1\pi$, bandwidth $0.01\pi$):
- Implemented in Direct Form II: the denominator polynomial $A(z) = 1 + a_1 z^{-1} + \cdots + a_8 z^{-8}$, where some $a_k$ have absolute values in the hundreds
- Under 32-bit floating point, rounding of $a_k$ shifts pole positions — a pole's radius changes from 0.998 to 1.003
- Result: the system is unstable. Output grows exponentially until overflow
- Switch to Cascade/SOS (4 second-order sections in series): each section's coefficient magnitudes are $\leq 2$, stable even with 16-bit fixed point
Origin: Why Are There So Many Structures?
Mathematically, an $N$th-order IIR transfer function:
can be computed using any equivalent set of difference equations. Different organizations correspond to different Signal Flow Graphs (SFGs), and each is a "structure." They are fully equivalent under infinite precision, but their behavior diverges drastically under finite precision — this is why studying structures matters.
Core Concepts: Five Major Structures
1. Direct Form I
The most intuitive: compute the numerator (FIR part) first, then the denominator (feedback).
Requires $M+N$ delay elements and $M+N+1$ multiplications. Two independent delay lines.
2. Direct Form II (Canonical Form)
Swap the order of numerator and denominator (allowed for LTI systems) and merge the delay lines.
Requires only $\max(M,N)$ delay elements (minimum delays = "canonical"), but the dynamic range of internal node $w[n]$ can be very large.
3. Cascade / SOS (Cascaded Second-Order Sections)
Decompose $H(z)$ into a product of $L = \lceil N/2 \rceil$ biquad sections. Each section handles only its own pair of conjugate poles and pair of zeros; the coefficient range is small and insensitive to quantization.
Industry Standard: Virtually all DSP chip filter libraries use SOS form. MATLAB's sosfilt and Python scipy's sosfilt are both SOS implementations.
4. Parallel Form
Second-order sections in parallel, obtained via partial fraction expansion. Each section operates independently and can be parallelized.
5. Lattice Structure
Parameterized by reflection coefficients $\kappa_i$. For all-pole filters, $|\kappa_i| < 1$ guarantees stability — this is a structural guarantee independent of precision.
Commonly used in speech coding (LPC) and adaptive filters.
Structure Comparison Table
| Structure | Delays | Multiplications | Quantization Sensitivity | Overflow Risk | Notes |
|---|---|---|---|---|---|
| Direct Form I | $M+N$ | $M+N+1$ | High | Medium | Most intuitive |
| Direct Form II | $\max(M,N)$ | $M+N+1$ | High | High (internal nodes) | Fewest delays |
| Cascade/SOS | $2L$ | $5L+1$ | Low | Low | Industry preferred |
| Parallel | $2L$ | $5L+1$ | Low | Low | Parallelizable |
| Lattice | $N$ | $2N$ | Lowest | Lowest | Stability guaranteed (all-pole) |
Expand: Why does SOS have low coefficient sensitivity?
Consider the denominator polynomial of an $N$th-order filter $A(z) = \prod_{k=1}^{N}(1-p_k z^{-1})$. In direct form, the coefficients $a_k$ are elementary symmetric functions of the poles.
Pole sensitivity with respect to coefficients:
$$\frac{\partial p_i}{\partial a_k} = \frac{-p_i^{N-k}}{\prod_{j \neq i}(p_i - p_j)}$$When poles cluster together (narrowband filters), the denominator $\prod_{j \neq i}(p_i - p_j)$ approaches zero, and the sensitivity approaches infinity.
In SOS, each section has only 2 poles $p_i, p_i^*$, and the sensitivity is:
$$\frac{\partial p_i}{\partial a_{1i}} = \frac{-p_i}{p_i - p_i^*} = \frac{-p_i}{2j\,\text{Im}(p_i)}$$This value is bounded (as long as the poles are not on the real axis) and does not worsen as the filter order increases. $\;\blacksquare$
How to Use: Structure Selection Guide
- FIR filters: Direct Form (or Transposed) is usually sufficient, since there is no feedback and no stability issues.
- Low-order IIR ($\leq 2$nd order): Direct Form II is fine. A single biquad is one SOS section.
- High-order IIR ($\geq 4$th order): Always use Cascade/SOS. Do not use Direct Form.
- Parallel computation needed (FPGA/GPU): Consider Parallel Form.
- Speech/adaptive (all-pole): Lattice structure, leveraging the stability guarantee of reflection coefficients.
Golden Rule: Never implement a high-order IIR filter using Direct Form. Even MATLAB warns in its filter documentation: "For high-order filters, use sosfilt."
Applications
- Embedded DSP (16/32-bit fixed-point): SOS is the only reliable way to implement IIR. Filter libraries for TI C5000/C6000 series are all SOS-based.
- Audio Processing (Parametric EQ): One biquad per frequency band, cascaded together. The EQ in DAWs like ProTools and Logic Pro are all biquad cascades.
- Control Systems: A PID controller can be implemented as a single biquad. Higher-order controllers use SOS.
- Speech Coding (LPC): The 10th-order all-pole model uses a Lattice structure; reflection coefficients can be directly compressed and transmitted.
Pitfalls and Limitations
1. SOS section ordering: The cascade order affects the dynamic range of intermediate nodes. General rule: place the sections with poles closest to the unit circle last (so the highest-gain section is processed last), or use the zp2sos function to automatically pair poles and zeros.
2. Gain distribution: The total gain $K$ should be distributed across sections, not concentrated in one — otherwise that section may overflow.
3. Transposed Form pitfall: Direct Form II Transposed is good for FIR (low accumulated error), but the internal dynamic range problem still exists for IIR.
Quick Check
Q1: How many delay elements does a 6th-order IIR filter need in Direct Form II? In Cascade/SOS?
Answer
Direct Form II: $\max(M,N) = 6$ delay elements (canonical form). Cascade/SOS: Split into $L = 3$ second-order sections, each with 2 delays, totaling $2 \times 3 = 6$. The number of delays is the same, but SOS has far lower coefficient sensitivity than Direct Form.
Q2: Why do fixed-point DSP chips almost exclusively use SOS for IIR?
Answer
Fixed-point arithmetic has very low precision (typically 16-bit, i.e., only 15 bits for the fractional part), so coefficient quantization errors are relatively large. In Direct Form, the coefficients of a high-order polynomial are symmetric functions of all the poles; tiny quantization can drastically alter pole positions and even cause instability. In SOS, each section manages only two poles, the coefficient range is small ($|a_{1i}| \leq 2$, $|a_{2i}| \leq 1$), and the impact of quantization error on poles is confined to a controllable range.
Interactive: Direct Form vs. Cascade/SOS Precision Comparison
The same 4th-order IIR filter implemented in both Direct Form II and Cascade/SOS. Compare the frequency response differences under float32 vs. float64.
Red dashed = float32 Direct Form (note high-frequency deviation), blue solid = float64 reference, green = float32 SOS (nearly coincides with reference).
References
- [1] A. V. Oppenheim & R. W. Schafer, Discrete-Time Signal Processing, 3rd ed., Ch. 6.
- [2] P. S. R. Diniz, E. A. B. da Silva, S. L. Netto, Digital Signal Processing: System Analysis and Design, Ch. 9.
- [3] L. B. Jackson, "Roundoff-Noise Analysis for Fixed-Point Digital Filters Realized in Cascade or Parallel Form," IEEE Trans. Audio Electroacoustics, 1970.
4E.1 Adaptive Filters
When the optimal filter changes over time — let the filter learn by itself
Learning Objectives
- Understand the motivation for adaptive filters and the M8B Wiener Filter optimal solution
- Derive the LMS (Least Mean Squares) algorithm and its convergence conditions
- Compare the performance and computational complexity of LMS / NLMS / RLS
- Build intuition through an interactive noise cancellation experiment
One-Sentence Summary
An adaptive filter is a "learn while doing" system — it does not need to know the signal's statistical properties in advance; instead, it continuously adjusts its coefficients based on the error signal during operation, converging toward the optimal solution.
Why Learn This?
Bernard Widrow and his doctoral student Marcian E. (Ted) Hoff (yes, the same Hoff who later co-invented the Intel 4004 microprocessor) proposed the LMS algorithm at Stanford University in 1960. It is a striking coincidence — an algorithm that transformed adaptive signal processing, co-invented by someone who would go on to transform the computer industry.
The breakthrough of LMS lies in its extreme simplicity: only three lines of computation (compute error, update weights, shift window), yet it can work in unknown and time-varying environments. Its simplicity enabled implementation on the most primitive hardware — Widrow built ADALINE (Adaptive Linear Element) using analog circuits in the 1960s.
Today, adaptive filters are everywhere: your phone cancels echo during calls (AEC), noise-canceling headphones suppress ambient noise (ANC), WiFi modems equalize channel distortion — all powered by LMS or its variants.
Previously...
So far, all the filters we have designed (FIR, IIR) are fixed — once the coefficients are designed, they never change. This is sufficient for static scenarios (fixed low-pass/high-pass requirements), but many real-world scenarios are dynamic:
- Acoustic Echo Cancellation (AEC): The speaker moves around the room, and the echo path continuously changes
- Channel Equalization: Wireless channel fading varies randomly over time
- Active Noise Control (ANC): The noise source's statistical properties change (engine RPM changes, wind speed varies)
We need a filter that can automatically track environmental changes.
Pain Point: You Don't Know What "Optimal" Is
In Wiener filter theory, the optimal FIR filter solution is:
where $\mathbf{R}_{xx}$ is the input signal's autocorrelation matrix, and $\mathbf{r}_{xd}$ is the cross-correlation vector between the input and the desired output.
The problem is:
- You usually do not know $\mathbf{R}_{xx}$ and $\mathbf{r}_{xd}$ (they are statistical quantities requiring large amounts of data to estimate)
- Even if estimated, solving $\mathbf{R}_{xx}^{-1}$ is an $O(M^3)$ operation ($M$ = filter length)
- The environment is changing — last second's statistics are already outdated
The adaptive filter strategy is: do not solve the equation; instead, approach the solution one step at a time — with each new sample, take a small step toward "better."
Origin: From Wiener to LMS
Evolution of ideas:
- Wiener (1949): Optimal solution $\mathbf{w}_{\text{opt}} = \mathbf{R}_{xx}^{-1}\mathbf{r}_{xd}$ — perfect but impractical (requires knowing the statistics)
- Steepest Descent: $\mathbf{w}[n+1] = \mathbf{w}[n] - \mu\,\nabla J[n]$, where $J = E[|e[n]|^2]$ is the mean squared error (MSE). Iteratively approaches $\mathbf{w}_{\text{opt}}$, but still requires computing the expected value of the gradient.
- LMS (Widrow & Hoff, 1960): Replace the true gradient with an instantaneous gradient estimate: $\hat{\nabla}J[n] = -2\,e[n]\,\mathbf{x}[n]$. This estimate is noisy, but its expected value points in the correct direction — the ancestor of Stochastic Gradient Descent (SGD).
Core Concepts: The LMS Algorithm
Intuition: Imagine you are blindfolded, standing in a bowl-shaped valley (the MSE surface), trying to reach the bottom (the optimal solution). You cannot see the global terrain, but you can feel the slope under your feet (the instantaneous gradient). LMS takes a small step downhill at each iteration — although the direction is not perfectly accurate each time (since it is an estimate), on average you converge toward the bottom.
LMS Algorithm (Three Core Lines)
1. Compute output and error:
$$\hat{d}[n] = \mathbf{w}^T[n]\,\mathbf{x}[n] = \sum_{k=0}^{M-1}w_k[n]\,x[n-k]$$ $$e[n] = d[n] - \hat{d}[n]$$2. Update weights:
$$\mathbf{w}[n+1] = \mathbf{w}[n] + \mu\,e[n]\,\mathbf{x}[n]$$$d[n]$: desired signal, $\mathbf{x}[n] = [x[n], x[n-1], \ldots, x[n-M+1]]^T$: input vector, $\mu$: step size
Choosing the Step Size $\mu$
$\mu$ is the most critical hyperparameter of LMS:
- Too small: Convergence is extremely slow, cannot track environmental changes
- Too large: Divergence (weights explode)
- Stability condition:
where $M$ = filter length, $\sigma_x^2$ = input power, $\lambda_{\max}$ = largest eigenvalue of $\mathbf{R}_{xx}$.
Practical Rule: Set $\mu \approx \frac{1}{10 \cdot M \cdot \hat{\sigma}_x^2}$ (1/10 of the stability bound), balancing stability and convergence speed.
NLMS: Normalized LMS
The LMS step size is sensitive to input power. NLMS normalizes by the input vector's energy at each step:
$\delta > 0$ is a small constant to prevent division by zero. $\tilde{\mu} \in (0, 2)$ no longer depends on input power.
RLS: Recursive Least Squares
Instead of gradient descent, directly solve the weighted least squares problem recursively.
$\lambda \in (0.95, 1)$ is the forgetting factor. RLS converges much faster (unaffected by eigenvalue spread), but requires $O(M^2)$ operations per step (recursive update of the inverse correlation matrix).
Comparison of the Three Algorithms
| Characteristic | LMS | NLMS | RLS |
|---|---|---|---|
| Computation per step | $O(M)$ | $O(M)$ | $O(M^2)$ |
| Convergence speed | Slow (depends on $\lambda_{\max}/\lambda_{\min}$) | Moderate | Fast |
| Tracking ability | Moderate | Moderate | Good |
| Stability | Requires careful $\mu$ selection | More robust | May be numerically unstable |
| Typical applications | AEC, channel equalization | General-purpose (preferred) | Fast convergence needs |
Expand: LMS convergence analysis
Define the weight error vector $\tilde{\mathbf{w}}[n] = \mathbf{w}[n] - \mathbf{w}_{\text{opt}}$ and substitute into the LMS update equation:
$$\tilde{\mathbf{w}}[n+1] = (\mathbf{I} - \mu\,\mathbf{x}[n]\mathbf{x}^T[n])\,\tilde{\mathbf{w}}[n] + \mu\,e_o[n]\,\mathbf{x}[n]$$where $e_o[n] = d[n] - \mathbf{w}_{\text{opt}}^T\mathbf{x}[n]$ is the optimal error (irreducible noise).
Taking expectations (using the independence assumption that $\mathbf{x}[n]$ and $\tilde{\mathbf{w}}[n]$ are independent):
$$E[\tilde{\mathbf{w}}[n+1]] = (\mathbf{I} - \mu\,\mathbf{R}_{xx})\,E[\tilde{\mathbf{w}}[n]]$$Let $\mathbf{R}_{xx} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^T$ (eigendecomposition), and define the rotated error $\mathbf{v}[n] = \mathbf{Q}^T\tilde{\mathbf{w}}[n]$:
$$E[v_i[n+1]] = (1 - \mu\lambda_i)\,E[v_i[n]]$$Convergence condition: $|1 - \mu\lambda_i| < 1$ for all $i$, i.e., $0 < \mu < 2/\lambda_{\max}$.
The convergence speed is determined by the slowest mode: $\tau_i = -1/\ln(1-\mu\lambda_i) \approx 1/(\mu\lambda_i)$. The slowest mode corresponds to $\lambda_{\min}$, so the larger the eigenvalue spread $\chi = \lambda_{\max}/\lambda_{\min}$, the slower the convergence.
Steady-state excess MSE:
$$J_{\text{excess}} \approx \mu \cdot M \cdot \sigma_{e_o}^2 / 2 = \text{misadjustment} \times J_{\min}$$Misadjustment $\mathcal{M} \approx \mu\,M\,\sigma_x^2 / 2 = \mu\,\text{tr}(\mathbf{R}_{xx})/2$. $\;\blacksquare$
How to Use: Noise Cancellation System Design
Scenario: The microphone captures a signal $d[n] = s[n] + v[n]$ (speech + noise), and a reference microphone captures a correlated copy of the noise $x[n]$.
Step 1: Initialization
$\mathbf{w}[0] = \mathbf{0}$, choose $M = 32$ (based on the impulse response length of the noise path), $\mu = 0.01$
Step 2: For each time step $n$
(a) Assemble the input vector $\mathbf{x}[n] = [x[n], x[n-1], \ldots, x[n-31]]^T$
(b) Compute the noise estimate $\hat{v}[n] = \mathbf{w}^T[n]\,\mathbf{x}[n]$
(c) Compute the error (= cleaned output) $e[n] = d[n] - \hat{v}[n] \approx s[n]$
(d) Update $\mathbf{w}[n+1] = \mathbf{w}[n] + \mu\,e[n]\,\mathbf{x}[n]$
Step 3: Monitor convergence
Observe the sliding average of $|e[n]|^2$ (learning curve) and verify that the MSE decreases to steady state. If it does not decrease or oscillates, reduce $\mu$.
Applications
| Application | $d[n]$ | $x[n]$ | Algorithm |
|---|---|---|---|
| Acoustic Echo Cancellation (AEC) | Near-end mic (speech + echo) | Far-end speech | NLMS / PBFDAF |
| Active Noise Control (ANC) | Error microphone | Reference microphone | FxLMS |
| Channel Equalization | Received signal | Training sequence | LMS / RLS |
| System Identification | System output | System input | LMS / NLMS |
| Beamforming | Array microphone signals | Desired direction reference | LMS / MVDR |
Pitfalls and Limitations
1. Reference signal must be correlated with noise: If $x[n]$ is uncorrelated with the noise component in $d[n]$, LMS cannot learn any useful mapping. In ANC, the reference microphone must be physically close to the noise source.
2. Tracking lag for non-stationary signals: LMS needs time to track environmental changes (approximately $10/(\mu\lambda_{\min})$ samples). If the environment changes too fast, the filter may never catch up.
3. Eigenvalue spread problem: When the condition number $\chi = \lambda_{\max}/\lambda_{\min}$ of $\mathbf{R}_{xx}$ is large (e.g., for speech signals, $\chi$ can exceed 100), the convergence speeds of different LMS modes vary dramatically. NLMS partially addresses this; RLS fully solves it but at higher computational cost.
Practical Advice: Start with NLMS ($\tilde{\mu} = 0.5$, $\delta = 10^{-6}$). If convergence is too slow, switch to frequency-domain adaptive filtering (FBLMS) or RLS.
Quick Check
Q1: What happens if the LMS step size $\mu$ is increased 10x?
Answer
Convergence speed increases roughly 10x (within the stable range), but the steady-state excess MSE also increases 10x ($\mathcal{M} \propto \mu$). If $\mu$ exceeds the stability bound $2/(M\sigma_x^2)$, the algorithm diverges — weights grow exponentially and the output explodes. Therefore, the choice of $\mu$ is a trade-off between convergence speed vs. steady-state accuracy.
Q2: What is the key improvement of NLMS over LMS? What is the cost?
Answer
NLMS normalizes the step size at each step by the energy of the input vector $\|\mathbf{x}[n]\|^2$, so that the effective step size automatically adapts to input power — shrinking when power is high (preventing divergence) and growing when power is low (accelerating convergence). The cost is minimal: one extra inner product per step ($O(M)$, same order as LMS), plus a small regularization constant $\delta$ to prevent division by zero. NLMS is far more practical than LMS and is the preferred choice for industrial applications.
Interactive: LMS Noise Cancellation Experiment
A speech signal contaminated by noise; the LMS filter attempts to cancel the noise. Adjust the step size $\mu$ and filter length $M$ to observe convergence speed and noise reduction performance.
Top: gray = noisy signal $d[n]$, blue = LMS output $e[n]$ (after noise cancellation). Bottom: learning curve (MSE vs. iterations). Try making $\mu$ too large to see divergence!
References
- [1] S. Haykin, Adaptive Filter Theory, 5th ed., Prentice Hall, 2014.
- [2] B. Widrow & M. E. Hoff, "Adaptive switching circuits," IRE WESCON Conv. Rec., 1960.
- [3] B. Widrow & S. D. Stearns, Adaptive Signal Processing, Prentice Hall, 1985.
- [4] A. H. Sayed, Adaptive Filters, Wiley, 2008.
5.1 Decimation & Interpolation
Two fundamental tools for changing the sampling rate — but carelessness creates ghosts
Learning Objectives
- Understand how decimation causes spectral compression and aliasing in the frequency domain
- Understand why interpolation produces spectral images in the frequency domain
- Master the design principles for anti-aliasing and anti-imaging low-pass filters
- Implement arbitrary sample-rate conversion using rational ratio $L/M$
One-Sentence Summary
Decimation keeps only every $M$th sample, compressing the spectrum $M$-fold and producing $M$ overlapping copies; interpolation inserts $L-1$ zeros between each pair of samples, narrowing the spectrum to $\pi/L$ but creating $L-1$ spectral images. Both require a low-pass filter for cleanup.
Why Learn This?
Ronald Crochiere & Lawrence Rabiner (1981) systematized the theoretical framework of multirate processing in their classic book Multirate Digital Signal Processing. Their motivation was practical: digital telephone systems used different sampling rates for different standards (8 kHz, 16 kHz, 44.1 kHz, 48 kHz), and devices needed seamless conversion. Before this, converting sampling rates almost always required D/A followed by A/D — poor quality and high cost. Multirate theory demonstrated that sample-rate conversion can be performed entirely in the digital domain, fundamentally transforming the communications and audio industries.
Previously...
Before entering multirate processing, review these key concepts:
- Sampling Theorem: The sampling rate $f_s$ must be greater than twice the signal's maximum frequency $f_{\max}$; otherwise aliasing occurs. The spectrum repeats with period $f_s$
- DTFT (Discrete-Time Fourier Transform): The spectrum of a discrete signal is a $2\pi$-periodic function, with $\omega = \pi$ corresponding to $f_s/2$ (Nyquist frequency)
- Low-Pass Filter (LPF): Passes only components below the cutoff frequency $\omega_c$ and suppresses high-frequency content
Pain Point: Why Can't You Simply Discard or Insert Samples?
You might think lowering the sampling rate is simple — just keep every few samples, right? Raising the sampling rate — just insert zeros in between, right?
- Directly discarding samples → aliasing: Naively taking every 5th sample of CD-quality music (44.1 kHz) to get 8.82 kHz causes all frequency components above 4.41 kHz to fold back into low frequencies, producing harsh distortion
- Directly inserting zeros → imaging: After inserting zeros between samples, multiple "mirror" copies of the original spectrum appear, sounding like metallic high-frequency noise
- Non-integer ratios → more complex: Converting from CD (44.1 kHz) to DAT (48 kHz) has a ratio of $160/147$ — neither integer upsampling nor integer downsampling
Core Lesson: Changing the sampling rate = changing the "scaling" and "periodicity" of the spectrum. Without filters to manage these changes, artifacts are inevitable.
Origin
The need for multirate processing arose from the development of digital telephone switching systems in the 1970s. ITU-T standards specified 8 kHz for telephone speech, 16 kHz for wideband speech, and 44.1 or 48 kHz for music. The initial solution was: digital → analog → resample → digital, which was expensive and accumulated noise.
Research by Crochiere and Rabiner at Bell Labs demonstrated that through the combination of integer upsampling + low-pass filtering + integer downsampling, any rational ratio $L/M$ sample-rate conversion can be performed entirely in the digital domain, and is theoretically lossless.
Core Concepts
Decimation by $M$: Intuition
Imagine you shot a 120fps slow-motion video and now want to convert it to 30fps. You keep every 4th frame. If the original video contains fast hand-waving (high frequency), 30fps may not represent it correctly — hand positions will jump or overlap (aliasing). So before discarding frames, you need to apply "motion blur" (low-pass filtering).
Time-domain operation:
Decimation
$$x_d[n] = x[nM]$$Keep every $M$th sample; the sampling rate drops from $f_s$ to $f_s/M$
Frequency-domain effect:
The spectrum is compressed $M$-fold, with $M$ shifted copies superimposed → aliasing occurs if the spectrum extends beyond $[-\pi/M, \pi/M]$
Anti-aliasing strategy: Before decimation, apply a low-pass filter with cutoff $\omega_c = \pi/M$ to remove components above $f_s/(2M)$. This prevents spectral overlap after compression.
Interpolation by $L$: Intuition
Imagine you have a 100x100 low-resolution image and want to enlarge it to 400x400. The crudest method is to insert zeros (black dots) between pixels, then smooth with a blur filter — this is exactly the principle of upsampling.
Time-domain operation:
Upsampling (Zero Insertion)
$$x_u[n] = \begin{cases} x[n/L], & n = 0, \pm L, \pm 2L, \ldots \\ 0, & \text{otherwise} \end{cases}$$Insert $L-1$ zeros between each pair of original samples; the sampling rate increases from $f_s$ to $Lf_s$
Frequency-domain effect:
The spectrum is "compressed" to $[0, \pi/L]$, but $L$ repeated images appear over $[0, 2\pi]$
Anti-imaging strategy: After upsampling, apply a low-pass filter with cutoff $\omega_c = \pi/L$ and gain $L$ to remove images while restoring the interpolated amplitude to the correct level.
Rational-Ratio Sample-Rate Conversion $L/M$
Sample-Rate Conversion Flow
$$x[n] \xrightarrow{\uparrow L} \xrightarrow{H(\omega_c = \pi/\max(L,M))} \xrightarrow{\downarrow M} y[n]$$First upsample by $L$ → low-pass filter (cutoff = smaller of $\pi/L$ and $\pi/M$) → downsample by $M$
Expand: Derivation of the decimation frequency-domain formula
Let $x_d[n] = x[nM]$; its DTFT is:
$$X_d(e^{j\omega}) = \sum_{n=-\infty}^{\infty} x[nM] e^{-j\omega n}$$Using the identity: for any sequence $x[n]$, we can write
$$\sum_{n} x[nM] e^{-j\omega n} = \frac{1}{M}\sum_{k=0}^{M-1}\sum_{m} x[m] e^{-j(\omega - 2\pi k)m/M}$$This is because "keeping every $M$th sample" is equivalent to first multiplying by a comb function with period $M$:
$$c[n] = \frac{1}{M}\sum_{k=0}^{M-1} e^{j2\pi kn/M} = \begin{cases}1, & n \equiv 0 \pmod{M}\\0, & \text{otherwise}\end{cases}$$Therefore $x[n] \cdot c[n]$ retains the samples at $n = 0, M, 2M, \ldots$, and its DTFT is:
$$\frac{1}{M}\sum_{k=0}^{M-1} X\!\left(e^{j(\omega - 2\pi k)/M}\right)$$This clearly shows: the spectrum after decimation is the sum of $M$ shifted copies of the original spectrum scaled by $M$. When these copies overlap, aliasing occurs.
How to Use: Sample-Rate Conversion in Practice
Example: CD 44.1 kHz → Telephone 8 kHz
- Compute the ratio: $8000/44100 = 80/441$. So $L = 80$, $M = 441$ (already in lowest terms)
- Upsample by 80: Insert 79 zeros between each pair of samples → intermediate rate = $44100 \times 80 = 3{,}528{,}000$ Hz
- Low-pass filter: Cutoff $\omega_c = \pi/\max(80, 441) = \pi/441$, corresponding to $f_c = 3{,}528{,}000/(2 \times 441) = 4{,}000$ Hz
- Downsample by 441: Keep every 441st sample → final rate = $3{,}528{,}000/441 = 8{,}000$ Hz
Practical Note: The intermediate rate of 3.528 MHz is excessively high; in practice, Polyphase structures (next section) or multistage cascades are used to avoid this problem.
Interactive: Frequency-Domain Effects of Upsampling and Downsampling
Select decimation or interpolation mode, adjust the factor, toggle the anti-aliasing/anti-imaging filter, and observe changes in the time-domain waveform and spectrum.
Applications
- Audio Sample-Rate Conversion: CD (44.1 kHz) ↔ DVD/Blu-ray (48 kHz) ↔ high-resolution audio (96/192 kHz). All DAW software (Pro Tools, Ableton) includes built-in multirate converters
- Digital Communications: Baseband signals use different rates at different processing stages — modulator (high-rate), equalizer (mid-rate), speech encoder (low-rate). Software-defined radio (SDR) makes extensive use of multistage decimation
- Image Scaling: Image magnification/reduction is essentially 2D upsampling/downsampling. Photoshop's "Bicubic" interpolation is a 2D low-pass interpolation filter
- Biomedical Signals: EEG is commonly sampled at 512 Hz or 1024 Hz, but many analyses focus only on 0-40 Hz brainwaves → decimation to 128 Hz greatly reduces computation
Pitfalls and Limitations
- Forgetting to filter before downsampling: This is the most common mistake. Once aliasing occurs, it is irreversible — the folded high-frequency and low-frequency components are permanently mixed together
- Upsampling does not "add information": Upsampling 8 kHz speech to 44.1 kHz will not make it sound clearer. You cannot recover frequency components above 4 kHz — that information was lost during the original sampling
- Non-integer ratio precision issues: The ratio 44100 → 48000 = $160/147$ requires upsampling by 160 then downsampling by 147, with an intermediate rate of 7.056 MHz. Multistage or Polyphase structures are essential in practice
- Group Delay: The FIR anti-aliasing filter (M4B Filter Design) introduces a delay of $(N-1)/2$ samples. In real-time applications, this delay may be unacceptable
Quick Check
Q1: A speech signal sampled at 16 kHz is decimated by a factor of 4. Without an anti-aliasing filter, at what frequency will the original 3 kHz component appear after decimation?
Show answer
The post-decimation sampling rate is $16/4 = 4$ kHz, with a Nyquist frequency of 2 kHz. The original 3 kHz component exceeds Nyquist and folds: $3 \text{ kHz}$ maps to $4 - 3 = 1$ kHz. Therefore, a spurious component appears at 1 kHz in the decimated signal — this is aliasing.
Q2: Why does a sample-rate conversion system use "upsample first, then downsample" ($\uparrow L \to \text{LPF} \to \downarrow M$) rather than "downsample first, then upsample"?
Show answer
If you downsample first, high-frequency components are irreversibly lost due to aliasing. Upsampling first only inserts zeros (producing removable images) without losing any information. After upsampling, a low-pass filter simultaneously removes the images and the components that would alias during the subsequent downsampling. This guarantees mathematically lossless conversion (assuming an ideal low-pass filter).
References: [1] Crochiere, R.E. & Rabiner, L.R., Multirate Digital Signal Processing, Prentice-Hall, 1983. [2] Oppenheim & Schafer, Discrete-Time Signal Processing, 3rd ed., Ch.4. [3] Vaidyanathan, P.P., Multirate Systems and Filter Banks, Prentice-Hall, 1993. [4] Lyons, R., Understanding Digital Signal Processing, 3rd ed., Ch.10.
5.2 Polyphase Decomposition
Don't compute then discard — compute only what you need
Learning Objectives
- Understand the computational waste in direct decimation filtering
- Master the mathematical derivation of polyphase decomposition and the Noble Identity
- Compute the savings that the polyphase structure provides
- Understand the dual application of polyphase for interpolation
One-Sentence Summary
The polyphase structure splits a long filter into $M$ short sub-filters, each operating at the lower rate, avoiding the waste of "compute then discard" and achieving a direct $M$-fold reduction in computation.
Why Learn This?
Bellanger, Bonnerot & Coudreuse (1976) first systematically proposed the polyphase filter structure for multirate processing in an IEEE paper. The core insight of this idea transformed the entire communications industry: in a digital receiver, if you need to low-pass filter before decimation, the conventional approach executes all multiply-accumulate operations at the high rate, then discards most of the results. Polyphase rearranges this process, computing only the output samples you actually need. In an era when DSP chip resources were limited (and still are), this $M$-fold speedup is decisive.
Previously...
- Decimation flow (previous section): First filter with a LPF with cutoff $\pi/M$, then keep every $M$th sample
- FIR convolution: $y[n] = \sum_{k=0}^{N-1} h[k]\, x[n-k]$, requiring $N$ multiply-accumulates per output
- Z-transform: $H(z) = \sum_{k} h[k] z^{-k}$, used to analyze the algebraic structure of the filter
Pain Point: Massive Computational Waste
Consider a decimation-by-$M = 4$ system:
- First apply FIR filtering to every input sample ($N$ multiply-accumulates per sample)
- Then keep only 1 out of every 4 outputs, discarding the other 3
- In other words, 75% of the computed results are directly thrown away!
If the filter has $N = 128$ taps and the input rate is 1 MHz:
- Direct method: $128 \times 10^6 = 1.28 \times 10^8$ multiplications/second
- Of which $3/4$ are wasted → only $3.2 \times 10^7$ multiplications/second are useful
Question: Can we rearrange the computation order so the filter operates only at the low rate ($f_s/M$), fundamentally eliminating the waste?
Origin
The name "polyphase" comes from "multiple phases" — grouping filter coefficients by their phase (delay offset). Conceptually, it is similar to a polyphase power system, where three phases are offset by 120 degrees and combine to deliver stable power.
The key mathematical breakthrough is the Noble Identity (named after B. Noble): it proves that a downsampler can "pass through" a filter, provided that the appropriate variable substitution is made in the filter's Z-transform. This allows us to downsample first, then filter at the lower rate.
Core Concepts
Intuition: Imagine you are a quality inspector on a factory assembly line, checking only 1 out of every 4 products. The traditional approach is "inspect every product fully, then discard 3 out of 4 reports." The polyphase approach is "only inspect the 4th product, but incorporate key data from the previous 3" — the number of inspections drops by 4x, yet the results are identical.
Polyphase Decomposition
Split the $N$-tap filter's coefficients into $M$ groups by residue:
Type-I Polyphase Decomposition
$$H(z) = \sum_{k=0}^{M-1} z^{-k}\, E_k(z^M)$$where $E_k(z) = \sum_{n} h[nM+k]\, z^{-n}$, $k = 0, 1, \ldots, M-1$
Each $E_k$ contains only every $M$th coefficient of the original filter:
- $E_0$: $h[0], h[M], h[2M], \ldots$ (phase 0)
- $E_1$: $h[1], h[M+1], h[2M+1], \ldots$ (phase 1)
- $E_{M-1}$: $h[M-1], h[2M-1], h[3M-1], \ldots$ (phase $M-1$)
Noble Identity
The downsampler can "pass through" the filter, provided $z$ is replaced by $z^M$
Applying the Noble Identity to a decimation system:
- Original: $x[n] \to H(z) \to \downarrow M \to y[n]$ (filtering at the high rate)
- Decompose: $H(z) = \sum z^{-k} E_k(z^M)$
- Apply the Noble Identity to each branch: downsample first, then filter with $E_k(z)$
- Result: all $M$ sub-filters operate at the low rate!
Computation Analysis
| Direct Method | Polyphase | |
|---|---|---|
| Multiplications per output | $N$ | $N$ |
| Output rate | $f_s$ (produced at high rate, then discarded) | $f_s/M$ (produce only what's needed) |
| Multiplications per second | $N \cdot f_s$ | $N \cdot f_s / M$ |
| Speedup | $M$x | |
Expand: Full derivation of polyphase decomposition
Starting from the Z-transform:
$$H(z) = \sum_{n=0}^{N-1} h[n] z^{-n}$$Write the index $n$ as $n = qM + k$, where $q = \lfloor n/M \rfloor$ and $k = n \bmod M$:
$$H(z) = \sum_{k=0}^{M-1} \sum_{q=0}^{\lceil N/M\rceil - 1} h[qM+k]\, z^{-(qM+k)}$$ $$= \sum_{k=0}^{M-1} z^{-k} \underbrace{\sum_{q} h[qM+k]\,(z^M)^{-q}}_{E_k(z^M)}$$Therefore $H(z) = \sum_{k=0}^{M-1} z^{-k} E_k(z^M)$, where each polyphase sub-filter is:
$$E_k(z) = \sum_{q} h[qM+k]\, z^{-q}, \quad k = 0, 1, \ldots, M-1$$Proof of the Noble Identity:
Let $v[n] = \sum_m g[m] w[n-m]$, where $g$ has Z-transform $G(z^M)$ (i.e., $g[n] \neq 0$ only when $n$ is a multiple of $M$). After downsampling:
$$v_d[n] = v[nM] = \sum_m g[m] w[nM - m]$$Since $g[m] = 0$ when $m \not\equiv 0 \pmod{M}$, let $m = lM$:
$$v_d[n] = \sum_l g[lM] w[nM - lM] = \sum_l g[lM] w_d[n-l] \cdot (\text{if }w_d[n]=w[nM])$$This is exactly $w$ downsampled first, then filtered by $G(z)$. QED.
How to Use: Polyphase Decimation Worked Example
Example: 24-tap FIR, Decimation by $M = 4$
- Group: Split $h[0] \ldots h[23]$ into 4 groups:
- $E_0$: $h[0], h[4], h[8], h[12], h[16], h[20]$ (6 coefficients)
- $E_1$: $h[1], h[5], h[9], h[13], h[17], h[21]$ (6 coefficients)
- $E_2$: $h[2], h[6], h[10], h[14], h[18], h[22]$ (6 coefficients)
- $E_3$: $h[3], h[7], h[11], h[15], h[19], h[23]$ (6 coefficients)
- Downsample input: Downsample each phase of the original input separately → each $E_k$ input rate = $f_s/4$
- Sub-filter: Each $E_k$ filters with 6 coefficients → 6 multiplications
- Combine: Sum 4 branches → each output uses $4 \times 6 = 24$ multiplications, but at $f_s/4$ rate
- Result:
Direct: 24 mults × 1000 samples/sec = 24,000 mults/sec Polyphase: 24 mults × 250 samples/sec = 6,000 mults/sec Speedup: 4x!
Interactive: Direct Method vs. Polyphase Computation Comparison
Adjust the filter length and decimation factor to observe the computational difference.
Applications
- Software-Defined Radio (SDR): Receiver front-end ADCs sample at several GHz and require multistage decimation to baseband. Every stage uses polyphase structures; otherwise the FPGA simply cannot keep up
- Digital Television (DVB): The channelizer simultaneously extracts multiple narrowband channels from a wideband signal and is essentially a polyphase filter bank
- Audio Resampling: Open-source libraries like libsamplerate and SoX internally use polyphase for high-quality sample-rate conversion
- Radar Pulse Compression: The matched filter + decimation combination uses polyphase for real-time processing of high-speed ADC data on FPGAs
Pitfalls and Limitations
- Only works for integer factors: The polyphase structure requires the decimation factor $M$ to be an integer. For rational ratios $L/M$, upsampling and downsampling polyphase structures must be combined
- Not applicable to IIR filters: The Noble Identity strictly holds only for causal FIR filters. The recursive structure of IIR makes polyphase decomposition more complex
- Filter length must be divisible by $M$: If not, zero-pad to a multiple of $M$ (does not affect frequency response, just adds a few zero coefficients)
- Memory access patterns: In hardware implementations, the non-contiguous memory access pattern of polyphase may reduce cache efficiency; careful data layout is needed
Quick Check
Q1: A 64-tap FIR filter is used in an 8x decimation system. How many multiplications per second does each method require? (Assume input rate 10 kHz)
Show answer
Direct: Filter at 10 kHz → $64 \times 10{,}000 = 640{,}000$ multiplications/sec, then discard 7 out of 8 outputs.
Polyphase: 8 sub-filters with $64/8 = 8$ coefficients each, operating at $10{,}000/8 = 1{,}250$ Hz. Each output requires $8 \times 8 = 64$ multiplications, but at a rate of only 1,250 Hz → $64 \times 1{,}250 = 80{,}000$ multiplications/sec. Speedup: $640{,}000/80{,}000 = 8$x, exactly equal to $M$.
Q2: The Noble Identity says "a downsampler can pass through a filter," but why doesn't this work for arbitrary filters? What is the key condition?
Show answer
The precise form of the Noble Identity is: $H(z^M)$ followed by $\downarrow M$ = $\downarrow M$ followed by $H(z)$. Note that the filter on the left is $H(z^M)$ (not $H(z)$). So not just any $H(z)$ can directly pass through the downsampler — you must first decompose $H(z)$ into polyphase form $\sum z^{-k} E_k(z^M)$, then apply the Noble Identity to each $E_k(z^M)$ term separately.
References: [1] Bellanger, M., Bonnerot, G. & Coudreuse, M., Digital Filtering by Polyphase Network: Application to Sample-Rate Alteration and Filter Banks, IEEE Trans. ASSP, 1976. [2] Vaidyanathan, P.P., Multirate Systems and Filter Banks, Ch.4-5, 1993. [3] harris, f.j., Multirate Signal Processing for Communication Systems, Prentice-Hall, 2004. [4] Crochiere & Rabiner, Multirate Digital Signal Processing, Ch.3, 1983.
5.3 Filter Banks
Split a signal into subbands, process each independently, and reassemble — a unified framework from MP3 to wavelets
Learning Objectives
- Understand the analysis-synthesis architecture of two-channel QMF (Quadrature Mirror Filter)
- Master the mathematical derivation of the Perfect Reconstruction (PR) condition
- Understand the Alias Cancellation condition
- Establish the connection between filter banks and the Discrete Wavelet Transform (DWT): the Mallat algorithm
One-Sentence Summary
A filter bank splits a signal into multiple subbands, decimates each for independent processing, then upsamples and synthesizes back to the original signal. If designed properly, Perfect Reconstruction (PR) is achievable — the output is exactly a delayed version of the input with no distortion whatsoever.
Why does this matter
Esteban & Galand (1977) proposed the Quadrature Mirror Filter (QMF) for sub-band speech coding at the ICASSP conference, inaugurating the era of subband processing. Their motivation: different frequency bands of speech have different perceptual importance; by processing them separately, more bits can be allocated to important bands and fewer to unimportant ones — the seed of Perceptual Coding. The MDCT (Modified Discrete Cosine Transform) used in MP3 is essentially a perfect reconstruction filter bank, and the DWT used in JPEG 2000 is as well. It is fair to say that half of modern compression technology is built on filter bank theory.
Previously...
- Decimation and interpolation (Section 5.1): Frequency-domain effects of decimation by 2 — spectral compression + aliasing
- Polyphase structure (Section 5.2): How to efficiently implement the "filter + decimate" combination
- Low-pass/high-pass filters: $H_0(z)$ low-pass retains $[0, \pi/2]$, $H_1(z)$ high-pass retains $[\pi/2, \pi]$
Pain Point: The Dilemma of Subband Processing
Many applications require frequency-band-specific processing:
- Audio compression: The ear is most sensitive to 1-4 kHz and insensitive to 16+ kHz → encode bands separately
- Noise removal: Noise may exist only in certain bands → process only those bands
- Equalizer: Independently adjust bass, mid, and treble
But if the bands need to be reassembled afterward, a problem arises:
Core Challenge: Decimation produces aliasing; upsampling produces imaging. During the analysis-synthesis round-trip, will these artifacts accumulate? Is there a way to make them cancel completely?
Origin
The QMF story: Esteban and Galand's original QMF design could only achieve "approximate" perfect reconstruction — aliasing could be completely eliminated, but some amplitude distortion remained.
Smith & Barnwell (1984) and Mintzer (1985) independently discovered true perfect reconstruction two-channel filter banks (PR-QMF), also known as Conjugate Quadrature Filters (CQF).
Stephane Mallat (1989) revealed a profound connection: repeatedly applying two-channel analysis to the low-pass output = the Discrete Wavelet Transform (DWT). This is the Mallat algorithm, which transformed abstract wavelet theory into efficiently computable filter bank operations, directly catalyzing the explosion of wavelets in image compression, denoising, and beyond.
Core Concepts
Intuition: Imagine traffic on a highway (the signal) reaching a toll station where it splits into two lanes (low-frequency/high-frequency). Each lane is tolled independently (processed), then merges back onto the same road. If the splitting and merging mechanisms are well designed, the exiting traffic is identical to the entering traffic (perfect reconstruction). If poorly designed, some cars get lost or duplicated (distortion).
Two-Channel Analysis-Synthesis System
The structure is as follows:
Analysis-synthesis equation:
Z-Transform of the Output
$$Y(z) = \underbrace{\tfrac{1}{2}\bigl[F_0(z)H_0(z) + F_1(z)H_1(z)\bigr]}_{\text{Transfer function } T(z)} X(z) + \underbrace{\tfrac{1}{2}\bigl[F_0(z)H_0(-z) + F_1(z)H_1(-z)\bigr]}_{\text{Alias term } A(z)} X(-z)$$Two design objectives:
- Alias Cancellation: Set $A(z) = 0$, i.e., $F_0(z)H_0(-z) + F_1(z)H_1(-z) = 0$.
A simple solution: $F_0(z) = H_1(-z)$, $F_1(z) = -H_0(-z)$ - Perfect Reconstruction: Set $T(z) = cz^{-d}$ (pure delay), i.e., $$F_0(z)H_0(z) + F_1(z)H_1(z) = 2cz^{-d}$$
QMF design
The classical QMF choice is $H_1(z) = H_0(-z)$ (mirror relation), i.e. $h_1[n] = (-1)^n h_0[n]$.
Meaning: if $H_0$ is a low-pass, then $H_0(-z)$ flips the frequency axis and automatically becomes a high-pass. This is the origin of the name "Quadrature Mirror"—the high-pass is the mirror image of the low-pass.
Link to the DWT: the Mallat algorithm
Feed the low-pass output $v_0[n]$ of a two-channel analysis back into another $(H_0, H_1)$ + $\downarrow 2$ stage:
This is exactly the Discrete Wavelet Transform (DWT)! Each level extracts details at a different scale. $H_0/H_1$ are the filters associated with the wavelet's scaling function / mother wavelet.
Show full derivation: the perfect-reconstruction condition
In a two-channel system, the combined effect of $\downarrow 2$ followed by $\uparrow 2$ is:
$$(\uparrow 2 \circ \downarrow 2)\{v\}(z) = \frac{1}{2}\bigl[V(z^{1/2}) + V(-z^{1/2})\bigr] \bigg|_{z \to z^2}$$More explicitly, let $v_0[n]$ be the result of filtering $x[n]$ with $H_0$ and then $\downarrow 2$:
$$V_0(z) = \frac{1}{2}\bigl[H_0(z^{1/2})X(z^{1/2}) + H_0(-z^{1/2})X(-z^{1/2})\bigr]$$(using the frequency-domain formula for $\downarrow 2$). Upsample again and pass through $F_0$:
$$U_0(z) = F_0(z) \cdot \frac{1}{2}\bigl[H_0(z)X(z) + H_0(-z)X(-z)\bigr]$$Similarly $U_1(z) = F_1(z) \cdot \frac{1}{2}[H_1(z)X(z) + H_1(-z)X(-z)]$.
Synthesis output:
$$Y(z) = U_0(z) + U_1(z) = T(z)X(z) + A(z)X(-z)$$where:
$$T(z) = \frac{1}{2}[F_0(z)H_0(z) + F_1(z)H_1(z)]$$ $$A(z) = \frac{1}{2}[F_0(z)H_0(-z) + F_1(z)H_1(-z)]$$Perfect reconstruction requires: $A(z) = 0$ (no aliasing) and $T(z) = cz^{-d}$ (pure delay).
If we take $H_1(z) = H_0(-z)$, $F_0(z) = H_0(z)$, $F_1(z) = -H_0(-z)$, then:
$$A(z) = \frac{1}{2}[H_0(z)H_0(-z) - H_0(-z)H_0(z)] = 0 \checkmark$$ $$T(z) = \frac{1}{2}[H_0^2(z) - H_0^2(-z)]$$PR requires $T(z) = cz^{-d}$, which imposes non-trivial constraints on the coefficients of $H_0$ (power complementarity). Daubechies wavelets are famous solutions that satisfy this condition.
How to Use: designing a two-channel filter bank
- Choose a prototype low-pass filter $H_0(z)$: design a half-band FIR low-pass with cutoff at $\pi/2$, e.g. using Daubechies coefficients or a CQF design.
- Derive the high-pass from the mirror relation: $H_1(z) = H_0(-z)$, i.e. $h_1[n] = (-1)^n h_0[n]$ (leave the even coefficients of the low-pass unchanged and flip the sign of the odd coefficients).
- Set the synthesis filters: $F_0(z) = H_0(z)$, $F_1(z) = -H_1(z)$ (or adjust according to the PR condition).
- Verify PR: compute $T(z)$ to confirm it is a pure delay and compute $A(z)$ to confirm it is zero.
- (For a DWT) iterate: feed $v_0[n]$ back into the same analysis filters and repeat for the desired number of levels.
Numerical example: Daubechies db2 (4-tap): $h_0 = [0.4830, 0.8365, 0.2241, -0.1294]$. You can verify that $T(z) = z^{-3}$ (3-sample delay) and the reconstruction is perfect.
Interactive: two-channel filter bank analysis and reconstruction
See how the input signal is split into low- and high-frequency subbands, and compare perfect reconstruction against non-perfect reconstruction.
Applications
- Speech coding (G.722): the ITU-T G.722 standard uses a two-channel QMF to split 7 kHz speech into low and high bands, coded separately with ADPCM. The low band receives more bits (most speech energy) and the high band fewer.
- MP3 audio compression: uses a 32-channel MDCT (Modified Discrete Cosine Transform) filter bank, essentially a type of cosine-modulated filter bank. Each subband is allocated a different number of quantization bits based on a psychoacoustic model.
- JPEG 2000 image compression: uses CDF 9/7 or Le Gall 5/3 biorthogonal wavelet filter banks for multi-level 2D subband decomposition of images. Better suited than the DCT of JPEG for handling image detail at different resolutions.
- Acoustic echo cancellation (AEC): subband adaptive filters converge independently in each band, converging faster and using less computation than full-band algorithms.
Pitfalls and Limitations
- Classical QMF cannot achieve true PR: the original Esteban-Galand QMF can only cancel aliasing while leaving residual amplitude distortion. True PR requires CQF/PR-QMF designs.
- Linear phase vs. PR conflict: orthogonal filter banks (e.g. Daubechies) achieve PR but not linear phase. Biorthogonal filter banks can achieve both, but analysis and synthesis filters differ.
- Leakage between bands: an ideal brick-wall split is impossible with finite-length filters. There is always leakage in the transition band, affecting independent subband processing.
- Delay: a PR filter bank introduces a delay $d$ equal to the combined length of analysis + synthesis filters minus one. In real-time communications this delay may exceed the acceptable range.
Quick Check
Q1: Why does $H_1(z) = H_0(-z)$ turn a low-pass into a high-pass? Explain in the frequency domain.
Show answer
$H_0(-z)$ is equivalent to replacing $\omega$ with $\omega + \pi$ in $H_0(e^{j\omega})$: $$H_1(e^{j\omega}) = H_0(e^{j(\omega+\pi)})$$ This shifts the frequency axis by $\pi$: the passband originally at $\omega = 0$ (low frequencies) moves to $\omega = \pi$ (high frequencies), and the stopband originally at $\omega = \pi$ moves to $\omega = 0$. So a low-pass becomes a high-pass, and the two frequency responses are mirror images — that is the meaning of "Quadrature Mirror."
Q2: With each extra level of decomposition in the Mallat algorithm, what happens to the sample rate and the bandwidth of the low-frequency subband?
Show answer
Each level is a $\downarrow 2$, so for every additional level: sample rate halves and bandwidth halves.
Level-$j$ low subband: sample rate = $f_s/2^j$, bandwidth = $[0, f_s/2^{j+1}]$.
For example with $f_s = 8$ kHz and 3 levels of decomposition:
Level 1 low band: $f_s/2 = 4$ kHz, bandwidth $[0, 2\text{ kHz}]$
Level 2 low band: $f_s/4 = 2$ kHz, bandwidth $[0, 1\text{ kHz}]$
Level 3 low band: $f_s/8 = 1$ kHz, bandwidth $[0, 0.5\text{ kHz}]$
References: [1] Esteban, D. & Galand, C., Application of Quadrature Mirror Filters to Split Band Voice Coding Schemes, ICASSP, 1977. [2] Smith, M.J.T. & Barnwell, T.P., Exact Reconstruction Techniques for Tree-Structured Subband Coders, IEEE Trans. ASSP, 1986. [3] Mallat, S., A Theory for Multiresolution Signal Decomposition, IEEE Trans. PAMI, 1989. [4] Vaidyanathan, P.P., Multirate Systems and Filter Banks, Ch.5-6, 1993. [5] Strang, G. & Nguyen, T., Wavelets and Filter Banks, Wellesley-Cambridge, 1996.
5.4 Sigma-Delta ADC (Oversampling & Noise Shaping)
1-bit quantizer + oversampling + noise shaping = the magic of 24-bit resolution
Learning Objectives
- Understand how oversampling spreads quantization noise, improving in-band SNR
- Master the feedback principle of noise shaping and the Noise Transfer Function (NTF)
- Compute the SNR improvement for different modulator orders
- Understand the complete Sigma-Delta ADC system: modulator → digital low-pass → M5A Decimation
One-Sentence Summary
A Sigma-Delta ($\Sigma\Delta$) ADC samples with the crudest quantizer in the world (a 1-bit comparator) at a very high rate, then uses a feedback loop to "push" the quantization noise out of the signal band, and finally recovers a high-resolution (16-24 bit) output with a digital filter and decimation. It trades the "precision" problem for a "speed" problem.
Why does this matter
Inose, Yasuda & Murakami (1962) first proposed Delta-Sigma modulation (originally an improvement to Delta Modulation) at the University of Tokyo. James Candy (1974) further developed the theoretical basis of oversampling ADCs at Bell Labs. However, Sigma-Delta ADCs did not really take off until CMOS VLSI processes matured in the 1990s—because they only need a comparator (1-bit quantizer) and some digital logic, and are implemented almost entirely in digital circuits, perfectly aligned with CMOS scaling trends.
Today, almost every consumer-electronics audio ADC/DAC is a Sigma-Delta architecture—your phone, headphones and sound cards all contain one. Understanding how it achieves 24-bit resolution from 1-bit + oversampling + noise shaping is one of the most elegant examples of trading speed for precision in DSP.
Previously
- Quantization noise: a $B$-bit uniform quantizer has SQNR = $6.02B + 1.76$ dB. A 1-bit quantizer gives SQNR $\approx 7.78$ dB (dreadful).
- Power spectral density (PSD): under the additive white-noise model, quantization noise is uniformly distributed over $[0, f_s/2]$ with PSD = $\Delta^2/(12 f_s)$.
- Decimation + low-pass filtering (section 5.1): first filter out the out-of-band noise, then decimate to the target rate.
The Problem: the cost of high-resolution ADCs
Traditional Nyquist-rate ADCs (SAR, Pipeline) face severe challenges to reach high resolution:
- 16-bit ADC: the comparator must distinguish $V_{\text{ref}}/65536$ volts. With a 3.3 V reference that is $50\,\mu\text{V}$—smaller than on-chip thermal noise!
- Resistor/capacitor matching: pipeline ADCs need 0.001% component matching, leading to low yield and high cost.
- Power: high-speed, high-precision ADCs can dissipate several watts, unsuitable for mobile devices.
Fundamental tension: CMOS processes are good at making "fast but crude" digital circuits, not "slow but precise" analog ones. Is there a way to use a "fast but crude" 1-bit quantizer and still get a "slow but precise" result?
Origin
Evolution of Sigma-Delta:
- Delta Modulation (1946, Deloraine): a 1-bit quantizer encoding the signal's "difference." Problems: slope overload and granular noise.
- Delta-Sigma Modulation (1962, Inose et al.): an integrator placed before Delta Modulation. The integrator causes quantization noise to be differentiated (high-pass shaped), greatly reducing in-band noise.
- Higher-order modulators (1980s-90s): cascading multiple integrators; each extra order pushes noise further away. Stability becomes a key challenge.
Naming dispute: the original paper called it "Delta-Sigma" ($\Delta\Sigma$) because the difference (Delta) comes first and then the integration (Sigma). The industry generally calls it "Sigma-Delta" ($\Sigma\Delta$) because in the block diagram the integrator comes first. Both names refer to the same thing.
Core Concepts
Step 1: oversampling
Intuition: imagine sprinkling a fistful of sand (quantization noise) uniformly on a wall. The total amount of sand is fixed, but if you make the wall $R$ times larger (expanded bandwidth), the sand per square centimeter drops by a factor of $R$. If you only look at a small middle portion of the wall (the signal band), there is far less sand (noise).
SNR improvement from oversampling
$$\text{SNR}_{\text{oversampled}} = \text{SQNR} + 10\log_{10}(R) \quad \text{dB}$$$R = f_s / (2f_b)$ is the Oversampling Ratio (OSR); $f_b$ is the signal bandwidth.
Every doubling of the oversampling ratio reduces in-band noise by 3 dB, an effective gain of only 0.5 bit of resolution. That is very inefficient—256x oversampling only buys 4 bits.
Step 2: noise shaping
Intuition: what if, instead of just spreading the sand more uniformly, you used a broom to sweep the middle sand to the edges? The middle (signal band) is almost clean, and the edges (out-of-band, high frequency) are piled up—and you are going to chop off the edges with a low-pass filter anyway.
Structure of a first-order Sigma-Delta modulator:
Linearized model (treat the quantizer as adding quantization noise $e[n]$):
First-order Sigma-Delta output
$$Y(z) = \underbrace{z^{-1}}_{\text{STF}} X(z) + \underbrace{(1 - z^{-1})}_{\text{NTF}} E(z)$$STF = Signal Transfer Function (signal is delayed by one sample); NTF = Noise Transfer Function (noise is high-pass shaped).
Frequency response of the NTF:
Zero at $\omega = 0$ (DC, low frequency), 2 at $\omega = \pi$ (Nyquist) → noise is pushed to high frequencies.
$L$-th order noise shaping
Cascading $L$ integrators (an $L$-th order modulator):
$L$-th order NTF
$$NTF_L(z) = (1 - z^{-1})^L$$$L$-th order differentiation → noise at low frequencies is suppressed even more strongly.
SNR of $L$-th order + $R$x oversampling
$$\text{SNR} \approx 6.02B + 1.76 + (2L+1) \cdot 10\log_{10}(R) - 10\log_{10}\!\left(\frac{\pi^{2L}}{2L+1}\right) \;\text{dB}$$For a 1-bit converter ($B=1$), each doubling of OSR gains $(2L+1) \times 3$ dB.
Key comparison:
Plain oversampling: $2\times$ OSR → +3 dB (+0.5 bit)
1st-order shaping: $2\times$ OSR → +9 dB (+1.5 bit)
2nd-order shaping: $2\times$ OSR → +15 dB (+2.5 bit)
4th-order shaping: $2\times$ OSR → +27 dB (+4.5 bit)
4th-order + 256x OSR → $\approx 27 \times 8 = 216$ dB theoretical → roughly 120 dB (20 bit) in practice.
Step 3: digital filtering + decimation
The modulator output is a high-rate 1-bit bitstream; it passes through:
- Digital low-pass filter (cutoff $f_b$) → removes the quantization noise that was shaped to high frequencies
- Decimation (factor $R$) → returns to the target sample rate $2f_b$
- Produces a high-resolution (16-24 bit) output
This "digital filter + decimation" is usually implemented as a multi-stage CIC (Cascaded Integrator-Comb) filter followed by half-band filters — using exactly the techniques from sections 5.1-5.2!
Show full derivation: transfer function of a first-order Sigma-Delta modulator
Let the integrator have transfer function $\frac{z^{-1}}{1-z^{-1}}$ (discrete integrator), and write the quantizer output as $y[n] = u[n] + e[n]$, where $u[n]$ is the quantizer input and $e[n]$ is the quantization error.
Loop equations in the Z-domain:
$$U(z) = \frac{z^{-1}}{1-z^{-1}}\bigl[X(z) - Y(z)\bigr]$$ $$Y(z) = U(z) + E(z)$$Substituting the first into the second:
$$Y(z) = \frac{z^{-1}}{1-z^{-1}}[X(z) - Y(z)] + E(z)$$ $$Y(z)\left[1 + \frac{z^{-1}}{1-z^{-1}}\right] = \frac{z^{-1}}{1-z^{-1}}X(z) + E(z)$$ $$Y(z) \cdot \frac{1}{1-z^{-1}} = \frac{z^{-1}}{1-z^{-1}}X(z) + E(z)$$ $$Y(z) = z^{-1}X(z) + (1-z^{-1})E(z)$$Therefore STF = $z^{-1}$ (the signal is delayed by one sample, distortion-free) and NTF = $(1-z^{-1})$ (quantization noise is shaped by a first-order high-pass).
In-band noise power:
$$\sigma_{\text{in-band}}^2 = \frac{\sigma_e^2}{f_s}\int_0^{2\pi f_b} |1-e^{-j2\pi f/f_s}|^2 df \approx \sigma_e^2 \cdot \frac{\pi^2}{3R^3}$$where $R = f_s/(2f_b)$ is the OSR. Compared with the unshaped $\sigma_e^2/R$, first-order shaping provides an additional $\pi^2/(3R^2)$ reduction.
How to Use: Sigma-Delta ADC design calculation
Example: audio ADC (20 kHz bandwidth, 16-bit target)
- Target SNR: 16 bit → $6.02 \times 16 + 1.76 = 98.1$ dB
- Architecture choice: 3rd-order modulator + 128x oversampling
- 1-bit SQNR = 7.78 dB
- OSR gain = $(2 \times 3 + 1) \times 10\log_{10}(128) = 7 \times 21.1 = 147.5$ dB
- Noise-shaping correction = $-10\log_{10}(\pi^6/7) \approx -12.7$ dB
- Theoretical SNR $\approx 7.78 + 147.5 - 12.7 = 142.6$ dB (23.4 bit)
- With non-ideal effects (op-amp limits, clock jitter, DAC nonlinearity) attenuated by 20-30 dB → actual roughly 112-122 dB (18-20 bit) ✓
- Sample rate: $f_s = 2 \times 20{,}000 \times 128 = 5.12$ MHz. A 1-bit comparator handles this speed easily.
- Digital filter: 4th-order CIC ($R = 16$) → 3 stages of half-band FIR ($\downarrow 2$) → FIR compensator → final $128\times$ decimation to 40 kHz.
Concrete numbers at a glance
| Configuration | Theoretical SNR | Equivalent bits | Notes |
|---|---|---|---|
| 1-bit, no oversampling | 7.78 dB | 1.0 bit | Bare comparator |
| 1-bit, 256x OSR, no shaping | $7.78 + 24.1 = 31.9$ dB | ~5 bit | Best you can do with oversampling alone |
| 1-bit, 256x OSR, 1st-order | ~79 dB | ~13 bit | Noise shaping starts to pay off |
| 1-bit, 256x OSR, 2nd-order | ~127 dB | ~21 bit | Very high in theory |
| 1-bit, 256x OSR, 4th-order | ~224 dB | ~37 bit (theory) | Limited by non-idealities in practice, ~120 dB (20 bit) |
Interactive: effects of oversampling and noise shaping
Adjust the oversampling ratio and the noise-shaping order and observe how the quantization-noise power spectral density changes, along with the equivalent SNR and bits.
Applications
- Audio ADC/DAC: from mobile phones to recording studios, almost every audio converter uses Sigma-Delta. The AKM AK5397 (32-bit/768 kHz) and ESS ES9038PRO are high-end examples.
- Precision measurement: 24-bit Sigma-Delta ADCs (e.g. ADS1256) are used for weighing, temperature measurement and strain gauges. Very high resolution + low speed = the sweet spot for Sigma-Delta.
- Sensor interfaces: MEMS accelerometers and gyroscopes use Sigma-Delta extensively to convert tiny capacitance changes into digital values.
- Digital radio receivers: a wideband Sigma-Delta ADC + digital downconversion can digitize an RF band directly (bandpass $\Sigma\Delta$).
- Class-D audio amplifiers: essentially Sigma-Delta DACs—driving a speaker with PWM switches, with noise shaping ensuring very low distortion in the audible range.
Pitfalls and Limitations
- Stability of high-order modulators: single-loop modulators of 3rd order or higher can become unstable (the quantizer's nonlinearity breaks the linear-analysis assumption). Remedies: MASH (Multi-stAge noise SHaping) structures or feedforward architectures.
- Latency: the digital decimation filters (especially multi-stage CIC + FIR) introduce significant delay, which may be unacceptable for real-time control applications.
- Bandwidth limit: high OSR demands a very high sample rate. 20 kHz audio × 256 = 5.12 MHz is fine, but wideband communications (tens of MHz) need ultra-fast comparators.
- Tones: low-order Sigma-Delta modulators can produce periodic patterns (idle tones) for DC or low-frequency inputs, which sound like a hum. Dither or higher-order modulators are needed to eliminate them.
- DAC nonlinearity (multi-bit Sigma-Delta): nonlinearity in the feedback DAC injects directly into the signal path. Multi-bit internal quantizers improve performance but demand very linear DACs. Remedy: DEM (Dynamic Element Matching).
Quick Check
Q1: A 2nd-order Sigma-Delta modulator uses 64x oversampling. How many dB does SNR improve per doubling of OSR? How many equivalent bits?
Show answer
For $L = 2$, SNR improvement = $(2L+1) \times 3 = 5 \times 3 = 15$ dB per octave of OSR.
Equivalent bits gained = $15/6.02 \approx 2.5$ bit per $2\times$ OSR.
So at 64x OSR ($= 2^6$): an increase of $6 \times 15 = 90$ dB ($\approx 15$ bit) above the baseline SQNR, and combined with the 1-bit baseline and shaping correction, the theoretical figure reaches about 85 dB ($\approx 14$ bit).
Q2: Why is the "digital filter + decimation" stage of a Sigma-Delta ADC exactly the place where the multirate techniques we learned earlier (polyphase, CIC) are used?
Show answer
The modulator outputs a very high-rate (e.g. 5.12 MHz) 1-bit bitstream. Turning it into a low-rate (e.g. 40 kHz) multi-bit output requires 128x decimation.
Running a single FIR at 5.12 MHz and then decimating by 128 wastes 99.2% of the computation. In practice we therefore use:
- CIC filter: no multipliers, only adders and delays—naturally suited to the first high-rate decimation stage.
- Polyphase half-band filters: for the subsequent $\downarrow 2$ stages, with a polyphase structure avoiding wasted computation.
This is the ideal real-world application of the techniques from sections 5.1-5.2.
References: [1] Inose, H., Yasuda, Y. & Murakami, J., A Telemetering System by Code Modulation: Delta-Sigma Modulation, IRE Trans., 1962. [2] Candy, J.C., A Use of Limit Cycle Oscillations to Obtain Robust Analog-to-Digital Converters, IEEE Trans., 1974. [3] Norsworthy, S.R., Schreier, R. & Temes, G.C., Delta-Sigma Data Converters: Theory, Design, and Simulation, IEEE Press, 1997. [4] Schreier, R. & Temes, G.C., Understanding Delta-Sigma Data Converters, 2nd ed., IEEE/Wiley, 2017. [5] Aziz, P.M., Sorensen, H.V. & van der Spiegel, J., An Overview of Sigma-Delta Converters, IEEE SP Magazine, 1996.
8A Random Processes & Wide-Sense Stationarity
Real signals always have randomness — the bridge between statistics and spectra
Learning Objectives
- Understand the definition of a random process: a collection of random variables indexed by time
- Distinguish between Strict-Sense Stationary (SSS) and Wide-Sense Stationary (WSS) definitions and their practicality
- Master the properties and physical meaning of the autocorrelation function $R_{xx}[m]$
- Derive and apply the Wiener-Khinchin theorem: $S_{xx}(e^{j\omega}) = \text{DTFT}\{R_{xx}[m]\}$
- Understand the spectral characteristics of white noise and colored noise
One-Sentence Summary
Random processes combine "randomness" with "time," and the Wiener-Khinchin theorem tells us: the Fourier transform of the autocorrelation function is the Power Spectral Density (PSD)—this bridge lets us analyse random signals with frequency-domain tools.
Why does this matter? Because there is no such thing as a "clean" signal in the real world. Every sensor reading contains noise, interference, and channel fading. The DFT, STFT and wavelets we have learned so far all assume the signal is deterministic—but for random signals you need statistical tools to describe the "average behaviour." This is the necessary foundation before Wiener filtering, Kalman filtering and adaptive filtering.
Previously: 5.7 Synchrosqueezing showed us fine time-frequency structure, but only for deterministic signals. Now we enter random signal analysis—we no longer ask "what does this signal look like?" but instead "what are the statistical properties of signals like these?" That requires a new mathematical framework.
The Problem: deterministic analysis is not enough
Suppose you are measuring vibration from an engine. Every time you start the engine the waveform is different (due to random factors such as ambient noise and initial conditions), yet the "statistical properties" (average power, spectral shape) are stable:
- Communications: received signal = original signal + channel noise. You cannot predict each noise sample, but you can describe its statistics (Gaussian white noise, power $\sigma^2$).
- Radar: the target echo is buried in clutter and thermal noise. Detection theory requires the noise power spectral density.
- Finance: stock prices are unpredictable (random walk) but volatility has statistical regularities.
- Biomedical: the alpha-band power of EEG is a statistic averaged over many measurements; a single measurement is unreliable.
Core question: the FFT spectrum of a random signal is "different every time" and does not converge to a fixed function. We need a way to define the frequency-domain properties of random signals through statistical averages.
Origin
Norbert Wiener (1930) proposed in his pioneering work Generalized Harmonic Analysis that for signals with finite power but infinite energy (such as random signals), the classical Fourier transform does not exist, but the Fourier transform of the autocorrelation function (the power spectral density) does make sense.
Alexander Khinchin (1934) independently proved the same result from a probability-theoretic perspective: for a stationary random process, the autocorrelation function and the power spectral density form a Fourier transform pair. This is the Wiener-Khinchin theorem—the central bridge between time-domain statistics (correlation functions) and frequency-domain representation (PSD).
Wiener went further and used PSD to derive the optimal linear filter (Wiener filter, see 8B), ushering in the era of statistical signal processing.
Core Concepts
Intuition: imagine a factory with 100 identical production lines, each with a vibration sensor recording data simultaneously. The waveform on each line is different (because of randomness), but their "average waveform" and "average power" are stable. A random process packages these 100 "possible waveforms" (each called a realization or sample function) into a single mathematical object.
Random process definition
A random process $\{x[n]\}$ is a collection of random variables indexed by time $n$. For each $n$, $x[n]$ is a random variable; for each "experiment" (realization), $x[n]$ is a deterministic time series (sample function).
Ensemble average vs. time average
| Averaging method | Definition | When it applies |
|---|---|---|
| Ensemble average | $E[x[n]] = \int x \cdot f_X(x;n)\, dx$ Fix $n$, average over all realizations | Theoretical analysis (requires multiple experiments) |
| Time average | $\langle x[n] \rangle = \lim_{N\to\infty}\frac{1}{2N+1}\sum_{n=-N}^{N} x[n]$ Fix one realization, average over all time | Practice (usually only a single measurement) |
Ergodicity: if the random process is ergodic, ensemble average = time average. This is an extremely important assumption in practice—because we usually only have one recording (a single realization) and must use it to estimate statistics.
Stationarity
Strict-Sense Stationary (SSS): all finite-dimensional joint PDFs are invariant under time shifts:
This condition is too strong—verifying "all" joint PDFs is practically impossible.
Wide-Sense Stationary (WSS): only requires the first- and second-order statistics to be time-invariant:
Two WSS conditions
$$\text{(1)}\quad E[x[n]] = \mu_x \quad \text{(mean is constant, does not depend on $n$)}$$ $$\text{(2)}\quad R_{xx}[n, n-m] = R_{xx}[m] \quad \text{(autocorrelation depends only on lag $m$, not on absolute time $n$)}$$WSS is the standard engineering assumption. Vibrations from most steady-state machines, stationary communication channels, and long-time environmental noise can all reasonably be treated as WSS.
Properties of the autocorrelation function
- $R_{xx}[0] = E[|x[n]|^2]$ = average power
- $R_{xx}[0] \geq |R_{xx}[m]|$ for all $m$ (maximum at the origin)
- $R_{xx}[-m] = R_{xx}^*[m]$ (conjugate symmetry)
- $R_{xx}[m]$ is a positive semi-definite function (guaranteeing a non-negative PSD)
Cross-correlation function:
The Wiener-Khinchin theorem
This is the most important result of the chapter—the bridge between time-domain statistics and frequency-domain representation:
Wiener-Khinchin Theorem
$$S_{xx}(e^{j\omega}) = \text{DTFT}\{R_{xx}[m]\} = \sum_{m=-\infty}^{\infty} R_{xx}[m]\, e^{-j\omega m}$$$S_{xx}(e^{j\omega})$ = Power Spectral Density (PSD)
Key properties of the PSD:
- $S_{xx}(e^{j\omega}) \geq 0$ (always non-negative and real)—it represents the "frequency distribution of power."
- $\frac{1}{2\pi}\int_{-\pi}^{\pi} S_{xx}(e^{j\omega})\, d\omega = R_{xx}[0]$ = average power
- $S_{xx}(e^{j\omega}) = S_{xx}(e^{-j\omega})$ (for real signals the PSD is an even function)
Show full derivation: sketch of the Wiener-Khinchin theorem
For finite-power random signals the DTFT does not converge directly (infinite energy). Use the truncated version instead:
$$X_N(e^{j\omega}) = \sum_{n=0}^{N-1} x[n]\, e^{-j\omega n}$$Define the periodogram:
$$I_N(\omega) = \frac{1}{N}|X_N(e^{j\omega})|^2$$Take the expectation:
$$E[I_N(\omega)] = \frac{1}{N}\sum_{n=0}^{N-1}\sum_{k=0}^{N-1} E[x[n]x^*[k]]\, e^{-j\omega(n-k)}$$Let $m = n - k$ and use the WSS condition $E[x[n]x^*[k]] = R_{xx}[n-k]$:
$$= \sum_{m=-(N-1)}^{N-1}\left(1 - \frac{|m|}{N}\right) R_{xx}[m]\, e^{-j\omega m}$$As $N \to \infty$:
$$\lim_{N\to\infty} E[I_N(\omega)] = \sum_{m=-\infty}^{\infty} R_{xx}[m]\, e^{-j\omega m} = S_{xx}(e^{j\omega}) \quad \blacksquare$$Note: this shows that the expectation of the periodogram converges to the PSD, but the variance of a single periodogram does not decrease with $N$ (which is why PSD estimation requires Welch's method or multitapers—recall section 3.2).
White noise
The autocorrelation is nonzero only at $m = 0$, so samples at different times are completely uncorrelated. Its DTFT is a constant, meaning the power is the same at all frequencies (the origin of the term "white", by analogy with white light containing all colours).
Coloured noise: white noise after filtering
If white noise $w[n]$ passes through an LTI system $H(z)$ producing output $y[n]$:
The "flat" white-noise spectrum is shaped by $|H|^2$, producing "coloured" noise with a specific spectral shape. For instance, the AR(1) filter $H(z) = 1/(1 - az^{-1})$ produces low-frequency-dominated pink noise.
How to Use: numerical example
Step 1: define the random process
Consider the AR(1) process $x[n] = 0.9\, x[n-1] + w[n]$, where $w[n]$ is white noise with $\sigma^2 = 1$.
Step 2: compute the theoretical autocorrelation
The AR(1) autocorrelation has a closed form:
Step 3: compute the theoretical PSD
$H(z) = 1/(1 - 0.9z^{-1})$, so:
At low frequency $S_{xx}(1) = 1/(1.81 - 1.8) = 100$ (= 20 dB); at high frequency $S_{xx}(-1) = 1/(1.81 + 1.8) \approx 0.277$ (= -5.6 dB). This is a low-pass type PSD.
Step 4: estimate from real data
Generate $N = 1024$ points of AR(1) data, estimate the PSD with Welch's method and compare to the theoretical value. The more averages, the more accurate the estimate.
Practical key point: when estimating PSDs you always trade off frequency resolution against estimator variance. Longer segments give better frequency resolution but fewer segments to average, making the estimate less stable (recall Welch's method in 3.2).
Interactive: estimating statistics of random processes
Observe how, as the number of realizations increases, the estimated autocorrelation and PSD gradually approach the theoretical values.
Applications
- Communication system design: channel noise is modelled as a WSS process, and its PSD determines receiver sensitivity and the choice of modulation. The AWGN (Additive White Gaussian Noise) channel is the most basic model.
- Radar signal processing: the PSD shape of clutter determines the design of the MTI (Moving Target Indication) filter. Gaussian-shaped clutter PSD requires multi-pulse cancellers.
- Vibration monitoring: vibration of a machine running in steady state can be treated as a WSS process. Peak frequencies in the PSD correspond to rotation speed and fault characteristic frequencies.
- Financial time series: the autocorrelation structure of stock-return series (e.g. GARCH models) drives volatility forecasting and risk management strategies.
- Speech coding: Linear Predictive Coding (LPC) models speech as an AR process; the Levinson-Durbin recursive solution of the autocorrelation matrix is the core algorithm.
Pitfalls and Limitations
- The WSS assumption does not always hold: non-stationary processes (e.g. engine acceleration, speech transitions) violate WSS. You then need short-time analysis (assuming approximate WSS over short windows) or a time-varying model.
- Ergodicity cannot be assumed: not every WSS process is ergodic. For example $x[n] = A\cos(\omega_0 n)$ (with $A$ a random variable) is WSS but not ergodic—the time average is always zero, the ensemble average is also zero, but the time-based and ensemble-based autocorrelation estimates may differ.
- Estimator bias from finite data: autocorrelation estimates at large lags $|m|$ are unreliable (only $N - |m|$ products are available for averaging). Rule of thumb: only trust estimates with $|m| < N/10$.
- The periodogram is a poor PSD estimator: the variance of a single periodogram does not decrease with $N$ (inconsistent estimator). You must use Welch's method, multitapers, or parametric methods (see sections 3.1–3.3).
- Cross-correlation is not causation: $R_{xy}[m] \neq 0$ only indicates linear statistical correlation, not that $x$ causes $y$.
Quick Check
Q1: White noise has autocorrelation $R_{ww}[m] = \sigma^2 \delta[m]$. What is its PSD and why is it called "white" noise?
Show answer
$S_{ww}(e^{j\omega}) = \sigma^2$, a constant—the power is equal at every frequency. Just as white light contains every visible frequency, this is why it is called "white" noise. Mathematically, the DTFT of $\delta[m]$ is the constant 1, multiplied by $\sigma^2$.
Q2: A WSS process has a PSD with a large peak at $\omega = 0$ and is near zero around $\omega = \pi$. What does this tell you? How would you describe it in filter language?
Show answer
This is a low-frequency-dominated process (e.g. 1/f noise or an AR(1) process with positive correlation). It is equivalent to the output of a low-pass filter $H(z)$ driven by white noise: $S_{xx} = |H|^2 \sigma^2$. Adjacent samples are positively correlated ($R_{xx}[1] > 0$) and the signal varies smoothly.
References: [1] Wiener, N., Generalized Harmonic Analysis, Acta Math., 55:117-258, 1930. [2] Khinchin, A., Korrelationstheorie der stationaren stochastischen Prozesse, Math. Ann., 109:604-615, 1934. [3] Oppenheim & Schafer, Discrete-Time Signal Processing, 3rd ed., Ch.11. [4] Haykin, S., Adaptive Filter Theory, 5th ed., Ch.2. [5] Papoulis, A. & Pillai, S.U., Probability, Random Variables and Stochastic Processes, 4th ed., Ch.9-12.
8B Wiener Filter
Optimal linear estimation when statistical properties are known — minimum mean square error filtering
Learning Objectives
- Understand the Wiener filter problem formulation: optimally estimating a target signal from noisy observations
- Derive the Wiener-Hopf equation: $\mathbf{R}_{xx}\, \mathbf{h}_{\text{opt}} = \mathbf{r}_{xd}$
- Understand the frequency-domain Wiener filter: $H_{\text{opt}}(e^{j\omega}) = S_{xd}(e^{j\omega}) / S_{xx}(e^{j\omega})$
- Analyze the special case of signal plus uncorrelated noise and its intuitive meaning
- Understand the relationship between Wiener filtering and M4E Adaptive Filtering (LMS)
One-Sentence Summary
The Wiener filter is the optimal linear filter when the signal and noise statistics are known—it automatically decides whether to "pass or suppress" at each frequency based on the local SNR, letting the signal through in high-SNR bands and attenuating the noise in low-SNR bands.
Why does this matter? The Wiener filter is the starting point of all optimal linear filtering theory. The LMS adaptive filter aims to approximate the Wiener solution; the Kalman filter is its non-stationary generalisation; the classical approaches to speech enhancement, image denoising, and communication equalisation all come from this framework. Understanding Wiener filtering is understanding the cornerstone of statistical signal processing.
Previously: 8A built up the WSS random-process framework—autocorrelation $R_{xx}[m]$, power spectral density $S_{xx}(e^{j\omega})$, and the Wiener-Khinchin theorem. Now we use these tools to answer a central question: how do we "best" recover the original signal from noisy observations?
The Problem: simple filters are not smart enough
You receive a noisy signal $x[n] = d[n] + v[n]$ ($d[n]$ is the desired signal, $v[n]$ is noise). Intuitively you would use a low-pass filter to denoise, but:
- How do you choose the cutoff frequency? If the signal and noise bands overlap (which almost always happens), any fixed cutoff will either damage the signal or let noise through.
- Speech denoising: the speech band (100 Hz – 4 kHz) overlaps white noise completely. A low-pass filter will also cut the high-frequency consonants (/s/, /f/).
- The idea: the ideal filter should be frequency-dependent—pass through in bands where the signal is strong, fully suppress in bands where noise dominates, and strike the best compromise in mixed bands.
Core question: given the signal and noise statistics (PSDs), can we systematically derive the "best" filter so that the power of the estimation error is minimised?
Origin
Norbert Wiener (1942/1949): during WWII Wiener developed optimal linear prediction and filtering theory in order to predict the flight trajectories of enemy aircraft for anti-aircraft gun aiming. His classified report was published after the war under the title Extrapolation, Interpolation, and Smoothing of Stationary Time Series (nicknamed "The Yellow Peril" because of its yellow cover—and because it was so hard to read).
Andrey Kolmogorov (1941): in the Soviet Union Kolmogorov independently derived essentially the same result for the prediction of stationary time series. For this reason it is sometimes called Kolmogorov-Wiener filtering.
The Wiener-Hopf equation is named after a class of integral equations that Wiener and Eberhard Hopf studied together in 1931. The Wiener filtering problem happens to reduce to this class of equations.
Core Concepts
Intuition: imagine listening to a friend in a noisy restaurant. What is your brain doing? In the bands where your friend's voice is clear (e.g. certain vowel formants) you listen almost completely; in the bands dominated by background noise (e.g. low-frequency rumble) you automatically ignore. The Wiener filter does exactly this—but in a mathematically optimal way.
Problem setup
- Observation: $x[n]$ (noisy signal)
- Target: $d[n]$ (signal to be estimated)
- Filter output: $\hat{d}[n] = \sum_{k} h[k]\, x[n-k] = h * x$
- Error: $e[n] = d[n] - \hat{d}[n]$
- Goal: minimise the mean-square error $J = E[|e[n]|^2]$
Derivation: the Wiener-Hopf equation
Take the gradient of $J$ and set it to zero:
Optimality condition (orthogonality principle)
$$\frac{\partial E[|e[n]|^2]}{\partial h^*[k]} = 0 \quad \forall k$$ $$\Longrightarrow \quad E[e[n]\, x^*[n-k]] = 0 \quad \forall k$$The optimal error is orthogonal to all observations (Orthogonality Principle)
Show full derivation: from orthogonality principle to Wiener-Hopf equation
Expand $e[n] = d[n] - \hat{d}[n] = d[n] - \sum_l h[l] x[n-l]$ and substitute into the orthogonality condition:
$$E\Big[\Big(d[n] - \sum_l h[l] x[n-l]\Big) x^*[n-k]\Big] = 0$$ $$E[d[n] x^*[n-k]] = \sum_l h[l]\, E[x[n-l] x^*[n-k]]$$Using the WSS definition, $E[x[n-l] x^*[n-k]] = R_{xx}[k-l]$ and $E[d[n] x^*[n-k]] = R_{dx}[k]$:
$$R_{dx}[k] = \sum_l h_{\text{opt}}[l]\, R_{xx}[k-l] = (h_{\text{opt}} * R_{xx})[k]$$This is the convolutional form of the Wiener-Hopf equation.
Matrix form (FIR filter, order $M$):
$$\underbrace{\begin{bmatrix} R_{xx}[0] & R_{xx}[1] & \cdots & R_{xx}[M-1] \\ R_{xx}[1] & R_{xx}[0] & \cdots & R_{xx}[M-2] \\ \vdots & & \ddots & \vdots \\ R_{xx}[M-1] & & \cdots & R_{xx}[0] \end{bmatrix}}_{\mathbf{R}_{xx}\ (\text{Toeplitz})} \underbrace{\begin{bmatrix} h[0] \\ h[1] \\ \vdots \\ h[M-1] \end{bmatrix}}_{\mathbf{h}_{\text{opt}}} = \underbrace{\begin{bmatrix} R_{dx}[0] \\ R_{dx}[1] \\ \vdots \\ R_{dx}[M-1] \end{bmatrix}}_{\mathbf{r}_{dx}}$$$\mathbf{R}_{xx}$ is a Toeplitz matrix, so it can be solved with the Levinson-Durbin algorithm in $O(M^2)$ (instead of $O(M^3)$ for general matrix inversion).
Frequency-domain form: taking the DTFT of the convolution equation:
$$S_{dx}(e^{j\omega}) = H_{\text{opt}}(e^{j\omega}) \cdot S_{xx}(e^{j\omega})$$ $$\boxed{H_{\text{opt}}(e^{j\omega}) = \frac{S_{dx}(e^{j\omega})}{S_{xx}(e^{j\omega})}} \quad \blacksquare$$Wiener-Hopf equation
$$\text{Matrix form:}\quad \mathbf{R}_{xx}\, \mathbf{h}_{\text{opt}} = \mathbf{r}_{dx}$$ $$\text{Frequency-domain form:}\quad H_{\text{opt}}(e^{j\omega}) = \frac{S_{xd}(e^{j\omega})}{S_{xx}(e^{j\omega})}$$Special case: signal + uncorrelated noise
If $x[n] = d[n] + v[n]$ where $d$ and $v$ are uncorrelated ($R_{dv}[m] = 0$), then:
- $S_{xx} = S_{dd} + S_{vv}$ (powers add)
- $S_{xd} = S_{dd}$ (cross-PSD equals the signal PSD)
Wiener denoising formula
$$H_{\text{opt}}(e^{j\omega}) = \frac{S_{dd}(e^{j\omega})}{S_{dd}(e^{j\omega}) + S_{vv}(e^{j\omega})}$$Intuitive interpretation:
- High-SNR bands ($S_{dd} \gg S_{vv}$): $H \approx S_{dd}/S_{dd} = 1$ → pass through
- Low-SNR bands ($S_{vv} \gg S_{dd}$): $H \approx S_{dd}/S_{vv} \approx 0$ → strongly suppressed
- In between: $H$ transitions smoothly with local SNR → frequency-dependent optimal trade-off
Essence: the Wiener filter is a frequency-dependent SNR gate—it independently asks at each frequency "is there more signal or more noise here?" and then applies the optimal gain. This is why it is far better than a low-pass filter with a fixed cutoff.
Minimum Mean-Square Error (MMSE)
In bands with zero SNR, the residual error is $S_{dd}/2$ (the best you can do is cut it in half); in bands with very high SNR, the residual error approaches zero.
Relationship to LMS adaptive filtering
| Property | Wiener filter | LMS adaptive filter |
|---|---|---|
| Required information | $R_{xx}$ and $r_{xd}$ (offline statistics) | Only the input $x[n]$ and reference $d[n]$ |
| Solution method | Solve the Wiener-Hopf equation (one-shot) | Stochastic approximation via gradient descent (per-sample update) |
| Convergence | Instantaneous (non-recursive) | Takes time to converge, controlled by step size $\mu$ |
| Tracking ability | None (assumes WSS) | Yes (can track slowly time-varying environments) |
| Relationship | After convergence, LMS oscillates around the Wiener solution | |
How to Use: numerical example for speech denoising
Step 1: problem setup
Clean speech $d[n]$ (simulated by a 100 Hz sinusoid) + white noise $v[n]$ ($\sigma^2 = 0.5$), input SNR = 0 dB.
Step 2: estimate the PSDs
Use Welch's method to estimate $S_{dd}$ (from the clean signal) and $S_{vv}$ (from a noise-only segment). In practice the clean signal is not available, so VAD (Voice Activity Detection) is used to estimate $S_{vv}$ during silence.
Step 3: compute the Wiener filter
where $k$ is the frequency-bin index. Near 100 Hz (strong signal) $H \approx 1$; far from 100 Hz (noise-dominated) $H \approx 0$.
Step 4: apply in the frequency domain
$\hat{D}[k] = H_{\text{opt}}[k] \cdot X[k]$, then IFFT back to the time domain.
Step 5: evaluation
Compare SNR before and after filtering: SNR improves from 0 dB to 10–15 dB (depending on how much the signal and noise spectra overlap).
Interactive: Wiener denoising
Watch how the Wiener filter automatically adjusts the gain at each frequency based on SNR. Drag the SNR slider to see how the filter shape and output quality change.
Applications
- Speech Enhancement: noise cancellation in mobile phone calls. Speech PSD is concentrated in the formant regions between 100 Hz and 4 kHz, while noise PSD is relatively flat. The Wiener filter lets speech through in the formant bands and suppresses noise in the other bands. Modern mobile-phone noise-cancellation chips are essentially improved Wiener filters.
- Astronomical image denoising: images from the Hubble Space Telescope suffer from photon noise and readout noise. Wiener deconvolution handles denoising and deblurring simultaneously.
- Communication equalizers: the MMSE equalizer is a direct application of the Wiener filter—it minimizes the mean-square error of ISI (intersymbol interference) + noise.
- EEG/ERP artifact removal: removing eye-movement, EMG and similar artifacts. Signal and artifact PSDs differ, so the Wiener filter can selectively suppress the artifact bands.
- Active Noise Control (ANC): feedforward/feedback path design in noise-cancelling headphones is essentially solving a causal Wiener filtering problem.
Pitfalls and Limitations
- Requires known PSDs: the Wiener filter needs $S_{dd}$ and $S_{vv}$, but in practice the clean signal is unavailable. Common approach: estimate $S_{vv}$ during silent segments and use $S_{dd} = S_{xx} - S_{vv}$ (which can become negative, requiring half-wave rectification or a spectral floor).
- Assumes WSS: speech, music and similar signals are highly non-stationary. Remedy: use short-time Wiener filtering (re-estimate the PSDs and recompute $H$ on each frame) combined with the STFT framework.
- Musical noise: spectral subtraction (a simplified Wiener filter) produces randomly appearing residual spectral peaks that sound like "whistles" or "water drops". Remedies: smooth the time trajectory of $H$ and set a spectral floor.
- Non-causal: the theoretical Wiener filter is non-causal (uses future samples). The causal version requires spectral factorization and is more complex.
- Only linearly optimal: if the noise is non-Gaussian, a nonlinear estimator may do better. Wiener is optimal among all linear estimators, not all estimators.
Quick Check
Q1: In the Wiener denoising formula $H = S_{dd}/(S_{dd}+S_{vv})$, if the SNR at some frequency is 0 dB (i.e. $S_{dd} = S_{vv}$), what is the value of $H$ there? Does it make intuitive sense?
Show answer
$H = S_{dd}/(S_{dd}+S_{dd}) = 0.5$. The Wiener filter sets the gain to 0.5 (6 dB attenuation). This is intuitive: when signal and noise have equal power they cannot be fully separated, and the optimal strategy is a compromise—pass half the energy, losing some signal but also suppressing half the noise.
Q2: Why is the LMS adaptive filter called an "online approximation" of the Wiener filter? What are its advantages?
Show answer
The Wiener filter needs to know $R_{xx}$ and $r_{xd}$ (statistics) in advance and is an offline one-shot solution. LMS does not need these statistics—it uses the current input and error to estimate the gradient direction and updates the filter coefficients sample by sample. After convergence, the LMS coefficients oscillate around the optimal Wiener solution. LMS advantages: (1) no prior statistics required, (2) it can track slowly time-varying environments.
References: [1] Wiener, N., Extrapolation, Interpolation, and Smoothing of Stationary Time Series, MIT Press, 1949. [2] Kolmogorov, A.N., Interpolation and Extrapolation of Stationary Random Sequences, Izv. Akad. Nauk SSSR, 5:3-14, 1941. [3] Haykin, S., Adaptive Filter Theory, 5th ed., Ch.3. [4] Oppenheim & Schafer, Discrete-Time Signal Processing, 3rd ed., Ch.11. [5] Boll, S., Suppression of Acoustic Noise in Speech Using Spectral Subtraction, IEEE Trans. ASSP, 27(2):113-120, 1979. [6] Loizou, P.C., Speech Enhancement: Theory and Practice, 2nd ed., CRC Press, 2013.
Interactive Lab
Signal Generator + FFT Spectrum Analyzer
Signal Generator
Time-Domain Waveform
FFT Magnitude Spectrum
- Sampling rate: 1024 Hz
- Number of samples: 1024 points
- Frequency resolution: Δf = fs/N = 1024/1024 = 1 Hz
- Maximum analyzable frequency: fs/2 = 512 Hz (Nyquist)
- Signal model: x[n] = A1sin(2πf1n/fs) + A2sin(2πf2n/fs) + noise
🧪 Guided Experiments
Experiment 1: Spectrum of a Single Sine Wave
Settings: f1=100Hz, A1=1.0, turn off f2 (A2=0), noise=0, window=Hann
Expected result: The time domain shows a clean sine wave. The spectrum has a sharp peak at 100Hz with almost no leakage to adjacent bins.
Try this: Switch to the Rectangular window and observe the sidelobes (leakage) appearing around 100Hz.
Experiment 2: Two Closely Spaced Frequencies
Settings: f1=100Hz, f2=108Hz, A1=A2=1.0, noise=0
Expected result: With the Hann window, you should see two separate peaks. Switch to the Rectangular window, and the two peaks may merge (because the wider main lobe causes mutual interference).
Try this: Move f2 closer to 104Hz and observe when the two peaks become indistinguishable.
Experiment 3: Weak Signal in Noise
Settings: f1=100Hz A1=1.0, f2=200Hz A2=0.1, noise=0.5
Expected result: The 100Hz peak is clearly visible, but the 200Hz peak may be buried in noise. Switch to the Blackman window (low sidelobes), and the 200Hz peak should become easier to identify.
Experiment 4: Harmonic Structure of a Square Wave
Settings: (requires changing the waveform to square in the code) f1=50Hz square wave, A1=1.0
Expected result: The spectrum shows peaks at 50, 150, 250, 350... Hz (odd harmonics), with amplitudes decaying as 1/n. This is a practical verification of the Fourier series.
Comprehensive Quiz: 20 Questions on Fourier Analysis
Covers core concepts from all six parts. Each question tests understanding, not formula memorization.
Question 1: In the Hilbert space L2[0, 2π], why can {ejnω} serve as a basis?
Question 2: The Dirac delta function δ(t) is not a function in the traditional sense. What mathematical framework does it belong to?
Question 3: What does the Uncertainty Principle tell us?
Question 4: What is the fundamental difference between Fourier Series (FS) and the Fourier Transform (FT)?
Question 5: The sampling theorem requires fs > 2fmax. If a signal has a bandwidth of 100-200 Hz (bandpass signal), what is the minimum sampling rate?
Question 6: DFT length N = 1024, sampling rate fs = 10 kHz. What is the frequency resolution? If you need to resolve two frequencies 1Hz apart, what is the minimum N?
Question 7: Regarding the relationship between the Z-transform and DTFT, which of the following is correct?
Question 8: You perform FFT analysis on a single-frequency signal using a rectangular window. If the signal frequency falls exactly between two FFT bins (non-integer bin), what happens?
Question 9: What is the main advantage of the Welch method over directly computing the periodogram? What is the trade-off?
Question 10: The MUSIC algorithm can achieve higher frequency resolution than FFT. Why?
Question 11: What is the primary purpose of the Hilbert transform?
Question 12: In cepstrum analysis, what does "liftering" mean?
Question 13: You need to analyze a 100ms chirp signal (frequency sweeping linearly from 1kHz to 5kHz). What should the STFT window length be?
Question 14: What is the greatest advantage of the Continuous Wavelet Transform (CWT) over STFT?
Question 15: In an OFDM system, what happens if the CP length is shorter than the channel delay spread?
Question 16: In FMCW radar, what does increasing the chirp bandwidth B improve?
Question 17: An engineer observes the 1st, 2nd, 3rd, and 4th harmonics of BPFO clearly in the vibration envelope spectrum. What does this indicate?
Question 18: In array signal processing, what problem arises when the antenna spacing d > λ/2? What is the time-domain analogy?
Question 19: Regarding HRV frequency-domain analysis, which of the following is a common misconception?
Question 20: In OLA (Overlap-Add) fast convolution, what condition must the FFT length N satisfy? What happens if it is not satisfied?
Question 21: What is the necessary and sufficient condition for BIBO stability of an LTI system?
Question 22: For $y[n] = x[n] + a\cdot y[n-1]$, what is $H(z)$?
Question 23: Among classic IIR designs, which has the steepest rolloff for given order?
Question 24: What effect does the bilinear transform $s = (2/T)(z-1)/(z+1)$ introduce?
Question 25: Why is SOS (Cascade) the standard structure for IIR implementation?
Question 26: What happens when LMS step size $\mu$ is too large?
Question 27: What must be done before downsampling a signal by factor M?
Question 28: How much computation does Polyphase save vs direct decimation by M?
Question 29: What is the Perfect Reconstruction (PR) condition for analysis-synthesis filter banks?
Question 30: What is the SNR improvement per octave OSR for L-th order Sigma-Delta?
Question 31: What is the core statement of the Wiener-Khinchin theorem?
Question 32: When signal $d[n]$ and noise $v[n]$ are uncorrelated, the Wiener filter is:
About This Platform
This educational platform is a comprehensive, graduate-level online resource on Fourier Analysis, covering six major parts from mathematical foundations to engineering practice.
Course Structure
| Part | Topic | Sections |
|---|---|---|
| Part I | Mathematical Foundations | 4 sections |
| Part II | Four Core Fourier Transforms + Z-Transform | 6 sections |
| Part III | Spectral Estimation | 4 sections |
| Part IV | Analytic Signals & Cepstrum | 3 sections |
| Part V | Time-Frequency Analysis | 5 sections |
| Part VI | Engineering Practice | 10 sections |
Design Philosophy
- Intuition First: Each concept is first explained in plain language ("why"), then formalized with equations, and finally accompanied by rigorous derivations in expandable
<details>blocks. - Problem-Driven: Starting from real-world questions -- "Why does the FFT on an FPGA lack precision?" "Why does the OFDM CP need to be so long?" -- rather than from abstract definitions.
- Engineering Connection: Every theoretical concept is paired with concrete industrial application examples and real numerical parameters.
- Interactive Exploration: Built-in interactive charts and a lab environment allow learners to adjust parameters and observe results firsthand.
Target Audience
- Graduate students in Electrical Engineering / Electronics / Communications / Computer Science
- Signal Processing / DSP Engineers
- Vibration Analysis / Predictive Maintenance Engineers
- Biomedical Engineering / Neuroscience Researchers
- Radar / Communication System Design Engineers
Technical Information
- Language: English. Technical terms include original terminology where appropriate.
- Math Typesetting: Mathematical formulas are marked with the CSS class
fm, supporting MathJax/KaTeX rendering. - Interactive Charts: Div containers with the
plotclass, rendered by JavaScript charting libraries (e.g., Plotly.js, Chart.js).
Key References
- Oppenheim, A.V. & Schafer, R.W. Discrete-Time Signal Processing, 3rd Ed., Pearson, 2010.
- Haykin, S. & Van Veen, B. Signals and Systems, 2nd Ed., Wiley, 2003.
- Proakis, J.G. & Manolakis, D.G. Digital Signal Processing, 4th Ed., Pearson, 2007.
- Randall, R.B. Vibration-based Condition Monitoring, Wiley, 2011.
- Goldsmith, A. Wireless Communications, Cambridge University Press, 2005.
- Richards, M.A. Fundamentals of Radar Signal Processing, 2nd Ed., McGraw-Hill, 2014.
- Van Trees, H.L. Optimum Array Processing, Wiley, 2002.
- Mallat, S. A Wavelet Tour of Signal Processing, 3rd Ed., Academic Press, 2009.
- Task Force of ESC/NASPE. "Heart rate variability: Standards of measurement," Circulation, 93(5), 1996.
- 3GPP TS 38.211, "NR; Physical channels and modulation," Release 17.
📓 Python Examples & Exercises
All Python code on this platform can be copied directly into your Jupyter Notebook and run. Recommended environment setup:
Recommended study workflow:
- Read through each chapter's theory and examples
- Copy the Python code into Jupyter, run it, and observe the output
- Modify parameters and watch how the behavior changes (try extreme values!)
- Download data from the Real-World Datasets on the home page to replace the synthetic signals
- Record your observations and open questions