Get MAD with Outliers with an Improved Median Function
21 Oct 2016This post presents a more robust method of detecting outliers in sample data than commonly used. The method is based on the median and an optimised F# median function is developed.
Background
Researchers most commonly use standard deviation around the mean to detect outliers. This is a problem because the mean and standard deviation give greater weight to extreme data points.
The mean is the point that minimises the sum of the square deviations, whereas the median is the point that minimises the sum of the absolute deviations. The median and median absolute deviation give a more robust measure of statistical dispersion and are more resilient to outliers.
When a politician says average wages are increasing be sure to check the median is being reported.
Median Absolute Deviation
The median is the value separating the higher half of the sample from the lower half.
The median absolute deviation is defined as
\[\operatorname{MAD} = \operatorname{median}\left(\left|x_i - \operatorname{median}(x_i)\right|\right)\]
Outliers can be identified as points that are outside a fixed multiple of the median absolute deviation from the median. Recommended values for this multiple are 2.5 or 3.
Median and MAD code
The following median function makes use of the MODIFIND algorithm by Vladimir Zabrodsky. It provides a 20-30% performance improvement over the Quickselect algorithm.
The array while loops allow equality which improves performance when there is duplication and ordering in the data.
The selectInPlace
function has also been extended to optionally return the middle of the kth and the k+1 element.
module Statistics =
/// Returns the median of three input values.
let inline median3 a b c =
if a<=b then
if b<=c then b
elif c<=a then a
else c
else
if a<=c then a
elif c<=b then b
else c
/// Returns the minimum value of a subsection of an array.
let inline minSub (a:_[]) lo hi =
let mutable v = a.[lo]
for i = lo+1 to hi do
let nv = a.[i]
if nv<v then v<-nv
v
/// Returns the maximum value of a subsection of an array.
let inline maxSub (a:_[]) lo hi =
let mutable v = a.[lo]
for i = lo+1 to hi do
let nv = a.[i]
if nv>v then v<-nv
v
/// Returns the middle point of the two smallest values of a subsection of an array.
let inline min2middleSub (a:_[]) lo hi =
let mutable v0 = a.[lo]
let mutable v1 = a.[lo+1]
if v1<v0 then
let tmp = v0
v0<-v1
v1<-tmp
for i = lo+2 to hi do
let nv = a.[i]
if nv<v0 then
v1<-v0
v0<-nv
elif nv<v1 then
v1<-nv
middleOrdered v0 v1
/// Returns the middle point of the two largest values of a subsection of an array.
let inline max2middleSub (a:_[]) lo hi =
let mutable v0 = a.[lo]
let mutable v1 = a.[lo+1]
if v1>v0 then
let tmp = v0
v0<-v1
v1<-tmp
for i = lo+2 to hi do
let nv = a.[i]
if nv>v0 then
v1<-v0
v0<-nv
elif nv>v1 then
v1<-nv
middleOrdered v1 v0
/// Swap two elements in an array.
let inline swap (a:_[]) i j =
let temp = a.[i]
a.[i]<-a.[j]
a.[j]<-temp
/// Swap two elements in an array if the first is larger than the second.
let inline swapIf (a:_[]) i j =
let ai = a.[i]
let aj = a.[j]
if ai>aj then
a.[i]<-aj
a.[j]<-ai
/// Returns the kth smallest element in an array and optionally the middle with
/// the next largest. Elements will be reordered in place and cannot be equal
/// to the max or min value of the generic type.
let inline selectInPlace (a:_[]) k middleNext =
let rec outerLoop lo hi =
swapIf a lo k
swapIf a lo hi
swapIf a k hi
let pivot = a.[k]
let resetLo = if a.[lo]=pivot then a.[lo]<-minValue(); true else false
let resetHi = if a.[hi]=pivot then a.[hi]<-maxValue(); true else false
let mutable i = lo+1
let mutable j = hi-1
while a.[i]<=pivot do i<-i+1
while a.[j]>=pivot do j<-j-1
while i<k && k<j do
swap a i j
i<-i+1
j<-j-1
while a.[i]<=pivot do i<-i+1
while a.[j]>=pivot do j<-j-1
if i<j then
swap a i j
if k<j then
i<-lo
j<-j-1
while a.[j]>=pivot do j<-j-1
else
j<-hi
i<-i+1
while a.[i]<=pivot do i<-i+1
else
if k<j then i<-lo
elif k>i then j<-hi
if resetLo then a.[lo]<-pivot
if resetHi then a.[hi]<-pivot
if i>=j then if middleNext then if k+1=i then
minSub a i hi |> middleOrdered pivot
else pivot
else pivot
elif k=i then if middleNext then min2middleSub a i j else minSub a i j
elif k=j then if middleNext then max2middleSub a i j else maxSub a i j
else outerLoop i j
outerLoop 0 (Array.length a-1)
/// Returns the median of an array. Elements will be reordered in place and
/// cannot be equal to the max or min value of the generic type.
let inline medianInPlace (a:_[]) =
match Array.length a-1 with
| 0 -> a.[0]
| 1 -> middle a.[0] a.[1]
| 2 -> median3 a.[0] a.[1] a.[2]
| last -> selectInPlace a (last/2) (last%2=1)
/// Returns the median and median absolute deviation of an array.
/// Elements cannot be equal to the max or min value of the generic type.
let inline medianAndMAD (a:_[]) =
let a = Array.copy a
let median = medianInPlace a
for i = 0 to Array.length a-1 do
a.[i]<-abs(a.[i]-median)
median,medianInPlace a
Property and Performance testing
A simple FsCheck
property test comparing the result with a full sort version ensures no mistakes have been made in the implementation.
The performance versus a full sort algorithm and the Math.Net C# Quickselect implementation are compared for different degrees of duplication and sorting.
The performance testing library developed in a previous post is used after extending it to allow subfunction measurement. This is run from the build script in 64-bit Release mode.
module StatisticsTests =
let inline medianQuickSelect (a:float[]) =
MathNet.Numerics.Statistics.ArrayStatistics.MedianInplace a
let inline medianFullSort a =
Array.sortInPlace a
let l = Array.length a
if l%2=0 then
let i = l/2
let x = a.[i-1]
let y = a.[i]
x+(y-x)*0.5
else a.[l/2]
[<Property>]
let MedianProp (x:int) (xs:int list) =
let l = x::xs |> List.map float
let m1 = List.toArray l |> Statistics.medianInPlace
let m2 = List.toArray l |> medianFullSort
m1=m2
type Duplication =
| Low | Medium | High
member i.ToInt = match i with | Low->500000000 | Medium->5000 | High->50
override i.ToString() =
match i with | Low->"Low" | Medium->"Medium" | High->"High"
type Sorted =
| No | Part | Yes
override i.ToString() = match i with | No->"No" | Part->"Part" | Yes->"Yes"
let list (duplication:Duplication) (sorted:Sorted) =
let r = System.Random 123
let next() = r.Next(0,duplication.ToInt) |> float
Seq.init 5000 (fun i ->
let l = List.init (i+1) (fun _ -> next())
match sorted with
| No -> l
| Yes -> List.sort l
| Part ->
let a = List.sort l |> List.toArray
let len = Array.length a
Seq.iter (fun _ -> Statistics.swap a (r.Next len) (r.Next len))
{1..len/4}
List.ofArray a
)
[<Fact>]
let MedianPerfTest() =
"| Duplication | Sorted | Current | MathNet | FullSort | 1.000 = |"
|> printfn
"|:-----------:|:----------:|:---------:|:---------:|:----------:|:---------:|"
|> printfn
Seq.collect (fun d -> Seq.map (fun s -> (d,s),list d s) [No;Part;Yes])
[Low;Medium;High]
|> Seq.iter (fun ((d,s),lists) ->
let timeStatistics f =
Performance.timeStatistics
(fun timer -> Seq.iter (List.toArray >> timer f >> ignore) lists)
let p1 = timeStatistics Statistics.medianInPlace
let p2 = timeStatistics medianQuickSelect
let p3 = timeStatistics medianFullSort
printfn
"| %-6s | %-4s | 1.000 | %6.3f | %6.3f | %.4fs |"
(string d) (string s) (p2.Mean/p1.Mean) (p3.Mean/p1.Mean) p1.Mean
)
Duplication |
Sorted |
Current |
MathNet |
FullSort |
1.000 = |
---|---|---|---|---|---|
Low |
No |
1.000 |
1.369 |
6.582 |
0.1028s |
Low |
Part |
1.000 |
1.350 |
9.269 |
0.0476s |
Low |
Yes |
1.000 |
1.516 |
12.964 |
0.0106s |
Medium |
No |
1.000 |
1.373 |
6.577 |
0.1018s |
Medium |
Part |
1.000 |
1.397 |
9.471 |
0.0478s |
Medium |
Yes |
1.000 |
1.840 |
17.519 |
0.0100s |
High |
No |
1.000 |
1.341 |
4.745 |
0.1059s |
High |
Part |
1.000 |
1.576 |
8.050 |
0.0526s |
High |
Yes |
1.000 |
3.193 |
26.390 |
0.0087s |
Conclusion
Optimised generic select, median and median absolute deviation functions have been developed.
The performance results show a good improvement over Quickselect which is already an optimised algorithm. The performance of the code is also more predictable due to its handling of duplication and partially sorted data.
This also demonstrates how property based testing and a performance testing library can be used together to optimise algorithms.
type AutoOpenAttribute =
inherit Attribute
new : unit -> AutoOpenAttribute
new : path:string -> AutoOpenAttribute
member Path : string
--------------------
new : unit -> AutoOpenAttribute
new : path:string -> AutoOpenAttribute
union case MinValueGen.MinValueGen: MinValueGen
--------------------
type MinValueGen =
| MinValueGen
static member ( => ) : MinValueGen * int -> int
static member ( => ) : MinValueGen * int64 -> int64
static member ( => ) : MinValueGen * float -> float
static member ( => ) : MinValueGen * MinValueGen -> 'a
val int : value:'T -> int (requires member op_Explicit)
--------------------
type int = int32
--------------------
type int<'Measure> = int
struct
member CompareTo : value:obj -> int + 1 overload
member Equals : obj:obj -> bool + 1 overload
member GetHashCode : unit -> int
member GetTypeCode : unit -> TypeCode
member ToString : unit -> string + 3 overloads
static val MaxValue : int
static val MinValue : int
static member Parse : s:string -> int + 3 overloads
static member TryParse : s:string * result:int -> bool + 1 overload
end
val int64 : value:'T -> int64 (requires member op_Explicit)
--------------------
type int64 = Int64
--------------------
type int64<'Measure> = int64
struct
member CompareTo : value:obj -> int + 1 overload
member Equals : obj:obj -> bool + 1 overload
member GetHashCode : unit -> int
member GetTypeCode : unit -> TypeCode
member ToString : unit -> string + 3 overloads
static val MaxValue : int64
static val MinValue : int64
static member Parse : s:string -> int64 + 3 overloads
static member TryParse : s:string * result:int64 -> bool + 1 overload
end
val float : value:'T -> float (requires member op_Explicit)
--------------------
type float = Double
--------------------
type float<'Measure> = float
struct
member CompareTo : value:obj -> int + 1 overload
member Equals : obj:obj -> bool + 1 overload
member GetHashCode : unit -> int
member GetTypeCode : unit -> TypeCode
member ToString : unit -> string + 3 overloads
static val MinValue : float
static val MaxValue : float
static val Epsilon : float
static val NegativeInfinity : float
static val PositiveInfinity : float
...
end
Returns the minimum value for a generic type.
from Microsoft.FSharp.Core
union case MaxValueGen.MaxValueGen: MaxValueGen
--------------------
type MaxValueGen =
| MaxValueGen
static member ( => ) : MaxValueGen * int -> int
static member ( => ) : MaxValueGen * int64 -> int64
static member ( => ) : MaxValueGen * float -> float
static member ( => ) : MaxValueGen * MaxValueGen -> 'a
Returns the maximum value for a generic type.
union case HalfPositiveGen.HalfPositiveGen: HalfPositiveGen
--------------------
type HalfPositiveGen =
| HalfPositiveGen
static member ( => ) : HalfPositiveGen * x:int -> int
static member ( => ) : HalfPositiveGen * x:int64 -> int64
static member ( => ) : HalfPositiveGen * x:float -> float
static member ( => ) : HalfPositiveGen * HalfPositiveGen -> 'a
Returns half of the positive generic input value.
Returns the middle point of two ordered input values.
Returns the middle point of two input values.
{T: float;
DF: int;}
Online statistics sequence for a given sample sequence.
from Microsoft.FSharp.Collections
Scale the statistics for a given underlying random variable change of scale.
Single iteration statistics for a given iteration count and total statistics.
Student's t-distribution inverse for 0.1% confidence level by degrees of freedom.
Welch's t-test statistic for two given sample statistics.
Welch's t-test for a given Welch statistic to a confidence level of 0.1%.
member Clone : unit -> obj
member CopyTo : array:Array * index:int -> unit + 1 overload
member GetEnumerator : unit -> IEnumerator
member GetLength : dimension:int -> int
member GetLongLength : dimension:int -> int64
member GetLowerBound : dimension:int -> int
member GetUpperBound : dimension:int -> int
member GetValue : [<ParamArray>] indices:int[] -> obj + 7 overloads
member Initialize : unit -> unit
member IsFixedSize : bool
...
from Main
Find the iteration count to get to at least the metric target.
Create and iterate a statistics sequence for a metric until the given accuracy.
Create and iterate two statistics sequences until the metric means can be compared.
Measure the given function for the iteration count using the start and end metric.
static member AddMemoryPressure : bytesAllocated:int64 -> unit
static member CancelFullGCNotification : unit -> unit
static member Collect : unit -> unit + 4 overloads
static member CollectionCount : generation:int -> int
static member EndNoGCRegion : unit -> unit
static member GetGeneration : obj:obj -> int + 1 overload
static member GetTotalMemory : forceFullCollection:bool -> int64
static member KeepAlive : obj:obj -> unit
static member MaxGeneration : int
static member ReRegisterForFinalize : obj:obj -> unit
...
GC.Collect(generation: int) : unit
GC.Collect(generation: int, mode: GCCollectionMode) : unit
GC.Collect(generation: int, mode: GCCollectionMode, blocking: bool) : unit
GC.Collect(generation: int, mode: GCCollectionMode, blocking: bool, compacting: bool) : unit
Measure the time metric for the given function and iteration count.
type Stopwatch =
new : unit -> Stopwatch
member Elapsed : TimeSpan
member ElapsedMilliseconds : int64
member ElapsedTicks : int64
member IsRunning : bool
member Reset : unit -> unit
member Restart : unit -> unit
member Start : unit -> unit
member Stop : unit -> unit
static val Frequency : int64
...
--------------------
Stopwatch() : Stopwatch
Measure the memory metric for the given function and iteration count.
GC.TryStartNoGCRegion(totalSize: int64, disallowFullBlockingGC: bool) : bool
GC.TryStartNoGCRegion(totalSize: int64, lohSize: int64) : bool
GC.TryStartNoGCRegion(totalSize: int64, lohSize: int64, disallowFullBlockingGC: bool) : bool
Measure the garbage collection metric for the given function and iteration count.
Measure definitions which are a metric together with a metric target.
Time statistics for a given function accurate to a mean standard error of 1%.
Memory statistics for a given function accurate to a mean standard error of 1%.
GC count statistics for a given function accurate to a mean standard error of 1%.
Time comparison for two given functions to a confidence level of 0.1%.
Memory comparison for two given functions to a confidence level of 0.1%.
GC count comparison for two given functions to a confidence level of 0.1%.
Returns the median of three input values.
Returns the minimum value of a subsection of an array.
Returns the maximum value of a subsection of an array.
Returns the middle point of the two smallest values of a subsection of an array.
Returns the middle point of the two largest values of a subsection of an array.
Swap two elements in an array.
Swap two elements in an array if the first is larger than the second.
Returns the kth smallest element in an array and optionally the middle with
the next largest. Elements will be reordered in place and cannot be equal
to the max or min value of the generic type.
Returns the median of an array. Elements will be reordered in place and
cannot be equal to the max or min value of the generic type.
Returns the median and median absolute deviation of an array.
Elements cannot be equal to the max or min value of the generic type.
from Main
module List
from Microsoft.FSharp.Collections
--------------------
type List<'T> =
| ( [] )
| ( :: ) of Head: 'T * Tail: 'T list
interface IReadOnlyList<'T>
interface IReadOnlyCollection<'T>
interface IEnumerable
interface IEnumerable<'T>
member GetSlice : startIndex:int option * endIndex:int option -> 'T list
member Head : 'T
member IsEmpty : bool
member Item : index:int -> 'T with get
member Length : int
member Tail : 'T list
...
from Main
val list : duplication:Duplication -> sorted:Sorted -> seq<float list>
--------------------
type 'T list = List<'T>
| Low
| Medium
| High
override ToString : unit -> string
member ToInt : int
| No
| Part
| Yes
override ToString : unit -> string
type Random =
new : unit -> Random + 1 overload
member Next : unit -> int + 2 overloads
member NextBytes : buffer:byte[] -> unit
member NextDouble : unit -> float
--------------------
Random() : Random
Random(Seed: int) : Random
Random.Next(maxValue: int) : int
Random.Next(minValue: int, maxValue: int) : int
from Main
val string : value:'T -> string
--------------------
type string = String