Median and MAD Revisited with an Online Estimator

07 Apr 2020

In a previous post a faster Selection algorithm was demonstrated. This will be revisited focusing on the Median and MAD statistical measures and their practical use in performance testing.

Though the Median is robust to outliers and a very useful measure, one downside is it requires all the sample values to be kept for the calculation. For online algorithms as the sample size increases this creates a memory problem. An online algorithm will be provided to estimate the Median and MAD in fixed memory.

Exact Median Algorithm

This algorithm is based on the MODIFIND algorithm by Vladimir Zabrodsky. In the previous post it was shown to behave well with partially sorted and duplicate data.

module Statistics =
    let medianInPlace (a:float[]) (index:int) (length:int) =
        let inline swap (a:float[]) i j =
            let temp = a.[i]
            a.[i]<-a.[j]
            a.[j]<-temp
        let inline swapIf (a:float[]) i j =
            let ai = a.[i]
            let aj = a.[j]
            if ai>aj then
                a.[i]<-aj
                a.[j]<-ai
        let k = length >>> 1
        let rec outerLoop lo hi =
            swapIf a lo k
            swapIf a lo hi
            swapIf a k hi
            let pivot = a.[k]
            let resetLo = if a.[lo]=pivot then a.[lo]<-Double.MinValue; true else false
            let resetHi = if a.[hi]=pivot then a.[hi]<-Double.MaxValue; true else false
            let mutable i = lo+1
            let mutable j = hi-1
            while a.[i]<=pivot do i<-i+1
            while a.[j]>=pivot do j<-j-1
            while i<k && k<j do
                swap a i j
                i<-i+1
                j<-j-1
                while a.[i]<=pivot do i<-i+1
                while a.[j]>=pivot do j<-j-1
            if i<j then
                swap a i j
                if k<j then
                    i<-lo
                    j<-j-1
                    while a.[j]>=pivot do j<-j-1
                else
                    j<-hi
                    i<-i+1
                    while a.[i]<=pivot do i<-i+1
            else
                if k<j then i<-lo
                elif k>i then j<-hi
            if resetLo then a.[lo]<-pivot
            if resetHi then a.[hi]<-pivot
            if i>=j then
                if length &&& 1 = 0 && j+1=k then
                    let mutable v = a.[lo]
                    for i = lo+1 to j do
                        if a.[i]>v then v<-a.[i]
                    (v + pivot) * 0.5
                else pivot
            else outerLoop i j
        outerLoop index (index+length-1)

Performance Testing

The algorithm is compared to a full sort, the Math.Net Quickselect and the Perfolizer QuickSelectAdaptive algorithms.

This performance testing technique created as part of a new testing library prototype uses a statistical test on the counts of the faster of two algorithms. The Median and MAD are estimated as useful information. As well as being robust to outliers they are also a good compliment since the faster algorithm will always have a positive Median performance improvement. This may not be true of the Mean.

Performance testing can be run across all threads and also while running other tests. It is also able to compare a range of input data instead of just a single data input.

It reaches statistically significate results extremely quickly compared to tests based on the Mean. A sigma of 6 gives a good stopping criteria and is used when skipping completed tests.

Online Estimator

In a previous post online statistical calculations were discussed. The following online Median and MAD estimator is a hybrid of an exact Median and MAD calculation followed by a recursive estimator. The size of the exact sample N and a learning rate Eta are taken as a parameters.

The standard error of the sample Median approximates to:

\[SE \sim \frac{MAD}{\sqrt{n}}\]

A fixed fraction Eta of the MAD for the learning rate is used as it provides a parameter with a sensible scale.

type MedianEstimator =
    val mutable private A : float array
    val mutable private Median : float
    val mutable private MAD : float
    val private Eta : float
    val mutable N : int
    new(n:int,eta:float) =
        {A = Array.zeroCreate n; Median = 0.0; MAD = 0.0; Eta = eta; N = 0}
    member m.MedianAndMAD =
        if isNull m.A then m.Median, m.MAD
        else
            let median = Statistics.medianInPlace m.A 0 m.N
            let mad = Array.sub m.A 0 m.N
            for i = 0 to mad.Length-1 do
                mad.[i] <- abs(mad.[i]-median)
            median, Statistics.medianInPlace mad 0 mad.Length
    member m.Add (s:float) =
        if isNull m.A then
            m.Median <- m.Median +  m.MAD * m.Eta * float(sign(s-m.Median))
            m.MAD <- m.MAD +  m.MAD * m.Eta * float(sign(abs(s-m.Median)-m.MAD))
            m.N <- m.N + 1
        else
            m.A.[m.N] <- s
            m.N <- m.N + 1
            if m.N = m.A.Length then
                let median, mad = m.MedianAndMAD
                m.Median <- median
                m.MAD <- mad
                m.A <- null

For performance testing N=99 and F=0.001 give stable results.

Conclusion

The superior performance of the MODIFIND algorithm has again been demonstrated compared to current algorithms.

A new type of statistical performance testing has been introduced with useful features of being able to run in parallel and across a range of inputs efficiently.

A fixed memory online Median and MAD estimator has been created to provide efficient and consistent statistics for use in performance testing and other performance critical applications.

Although early days these techniques together look to provide robust and efficient results.

val medianInPlace : a:float [] -> index:int -> length:int -> float

val a : float []

Multiple items
val float : value:'T -> float (requires member op_Explicit)

--------------------
type float = System.Double

--------------------
type float<'Measure> = float

val index : int

Multiple items
val int : value:'T -> int (requires member op_Explicit)

--------------------
type int = int32

--------------------
type int<'Measure> = int

val length : int

val swap : (float [] -> int -> int -> unit)

val i : int

val j : int

val temp : float

val swapIf : (float [] -> int -> int -> unit)

val ai : float

val aj : float

val k : int

val outerLoop : (int -> int -> float)

val lo : int

val hi : int

val pivot : float

val resetLo : bool

val resetHi : bool

val mutable i : int

val mutable j : int

val mutable v : float

MedianEstimator.A: float array

type 'T array = 'T []

MedianEstimator.Median: float

MedianEstimator.MAD: float

MedianEstimator.Eta: float

MedianEstimator.N: int

val n : int

val eta : float

module Array

from Microsoft.FSharp.Collections

val zeroCreate : count:int -> 'T []

val m : MedianEstimator

val isNull : value:'T -> bool (requires 'T : null)

val median : float

module Statistics

from 2020-04-07-Median-Revisited

val mad : float []

val sub : array:'T [] -> startIndex:int -> count:int -> 'T []

property System.Array.Length: int

val abs : value:'T -> 'T (requires member Abs)

val s : float

val sign : value:'T -> int (requires member get_Sign)

val mad : float

property MedianEstimator.MedianAndMAD: float * float

« Integrated Random Testing Bitemporal Source Generator Demo »

Anthony Lloyd

Median and MAD Revisited with an Online Estimator

Exact Median Algorithm

Performance Testing

Online Estimator

Conclusion