percentile
Computes the specified percentile of values.
This is also called the "centile" or the 100-quantile.
The 25th percentile is also known as the first quartile (Q1), the 50th percentile as the median or second quartile (Q2), and the 75th percentile as the third quartile (Q3).
null
values in the input are ignored. The operations either throw an exception when the input is empty (after filtering null
or NaN
values), or they return null
when using the -orNull
overloads.
All primitive numeric types are supported: Byte
, Short
, Int
, Long
, Float
, and Double
, but no mix of different number types. In these cases, the return type is always Double?
. The results of the operation on these types are interpolated using Quantile Estimation Method R8.
The operation is also available for self-comparable columns (so columns of type T : Comparable<T>
, like DateTime
, String
, etc.) In this case, the return type remains T?
. The index of the result of the operation on these types is rounded using Quantile Estimation Method R3.
All operations on Double
/Float
have the skipNaN
option, which is set to false
by default. This means that if a NaN
is present in the input, it will be propagated to the result. When it's set to true
, NaN
values are ignored.
DataFrame follows Hyndman, Rob & Fan, Yanan. (1996). Sample Quantiles in Statistical Packages. The American Statistician. 50. 361-365. 10.1080/00031305.1996.10473566. and Apache Commons Statistics for the 9 commonly used quantile estimation methods.
For the percentile
operation, DataFrame uses estimation method R3 when the given percentile needs to be selected from the values (like for self-comparable columns), and R8 when the given percentile can be interpolated from the values (of a numeric column). R8 was the recommended method by Hyndman and Fan, though other libraries, like Numpy default to R7, so slightly different results are to be expected.
In the future we might add an option to change the quantile estimation method.
df.percentile(25.0) // 25th percentile of values per every comparable column
df.percentile(75.0) { age and weight } // 75th percentile of all values in `age` and `weight`
df.percentileFor(50.0, skipNaN = true) { age and weight } // 50th percentile of values per `age` and `weight` separately
df.percentileOf(75.0) { (weight ?: 0) / age } // 75th percentile of expression evaluated for every row
df.percentileBy(25.0) { age } // DataRow where the 25th percentile of `age` lies (index rounded using R3)
df.percentile(25.0)
df.age.percentile(75.0)
df.groupBy { city }.percentile(50.0)
df.pivot { city }.percentile(75.0)
df.pivot { city }.groupBy { name.lastName }.percentile(25.0)
See statistics for details on complex data aggregations.
See column selectors for how to select the columns for this operation.
The following automatic type conversions are performed for the percentile
operation. (Note that null
only appears in the return type when using -orNull
overloads).
Conversion | Result for Empty Input |
---|---|
T -> T where T : Comparable\<T\> | null |
Int -> Double | null |
Byte -> Double | null |
Short -> Double | null |
Long -> Double | null |
Double -> Double | null |
Float -> Double | null |
Nothing -> Nothing | null |