Skip to content

Entropy #24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 51 commits into from
Mar 10, 2019
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
76986a9
Add entropy trait
LukeMathWalker Jan 22, 2019
1395859
Remove unnecessary imports
LukeMathWalker Jan 22, 2019
b741b3d
Implemented entropy
LukeMathWalker Jan 22, 2019
7b4f6d5
Return entropy with reversed sign, as per definition
LukeMathWalker Jan 22, 2019
cc221e2
Fixed tests
LukeMathWalker Jan 22, 2019
7473815
Added signature for cross entropy
LukeMathWalker Jan 23, 2019
ea3e81f
Fixed typo.
LukeMathWalker Jan 23, 2019
2998395
Implemented cross_entropy
LukeMathWalker Jan 27, 2019
bfc3d22
Added tests
LukeMathWalker Jan 27, 2019
a69ed91
Refined panic condition
LukeMathWalker Jan 27, 2019
3d7929b
Added test vs SciPy
LukeMathWalker Jan 27, 2019
27dbd00
Added test vs SciPy
LukeMathWalker Jan 27, 2019
873871b
Added KL divergence
LukeMathWalker Jan 27, 2019
dc85e9a
Added KL tests
LukeMathWalker Jan 27, 2019
ca95788
Renamed to kl_divergence
LukeMathWalker Jan 27, 2019
d21a0bb
Update src/entropy.rs
jturner314 Feb 25, 2019
c127428
Improved docs on behaviour with not normalised arrays
LukeMathWalker Feb 26, 2019
0106d65
Improved docs on behaviour with not normalised arrays
LukeMathWalker Feb 26, 2019
b28f461
Use mapv
LukeMathWalker Feb 26, 2019
ddf358b
Styling on closures (avoid dereferencing)
LukeMathWalker Feb 26, 2019
afdcf06
Allow different data ownership to interact in kl_divergence
LukeMathWalker Feb 26, 2019
28b4efd
Allow different data ownership to interact in kl_divergence
LukeMathWalker Feb 26, 2019
8c04f9c
Allow different data ownership to interact in cross_entropy
LukeMathWalker Feb 26, 2019
450cfb4
Add a test
LukeMathWalker Feb 26, 2019
5d45bdf
Doc improvement
LukeMathWalker Feb 26, 2019
5c72f55
Check the whole shape
LukeMathWalker Feb 26, 2019
168ffa5
Merge remote-tracking branch 'origin/entropy' into entropy
LukeMathWalker Feb 26, 2019
bb38763
Fix docs
LukeMathWalker Feb 26, 2019
c470a3a
Broken usage of Zip
LukeMathWalker Feb 26, 2019
e4be9b9
Fixed zip, mistery
LukeMathWalker Feb 26, 2019
57537c3
Use Zip for cross_entropy
LukeMathWalker Feb 26, 2019
80198bc
Add failure crate as dependency
Mar 8, 2019
93371f8
Errors module
Mar 8, 2019
5f6a004
Use failure crate
Mar 8, 2019
42c3600
Add ShapeMismatch error
Mar 8, 2019
02a63de
Merge branch 'entropy' of github.com:LukeMathWalker/ndarray-stats int…
Mar 8, 2019
05d5c66
Return Result
Mar 8, 2019
99a391e
Fix test suite
Mar 8, 2019
3a3d1f6
Fix docs
Mar 8, 2019
ca31af8
Fix docs
Mar 8, 2019
e65ef61
Add docs to error
Mar 8, 2019
e39025c
Update src/entropy.rs
jturner314 Mar 10, 2019
99b999f
Update src/entropy.rs
jturner314 Mar 10, 2019
ac4c159
Update src/entropy.rs
jturner314 Mar 10, 2019
b429ec7
Better semantic
LukeMathWalker Mar 10, 2019
e9679fa
Use Error instead of Fail
LukeMathWalker Mar 10, 2019
d2dfe8f
Formatting
LukeMathWalker Mar 10, 2019
d8583d2
Merge branch 'master' into entropy
LukeMathWalker Mar 10, 2019
b8ed3ed
Fix TOC
LukeMathWalker Mar 10, 2019
c961d9f
Module docstring
LukeMathWalker Mar 10, 2019
e7ec4b4
Merge branch 'entropy' of github.com:LukeMathWalker/ndarray-stats int…
LukeMathWalker Mar 10, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
339 changes: 339 additions & 0 deletions src/entropy.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,339 @@
//! Summary statistics (e.g. mean, variance, etc.).
use ndarray::{Array1, ArrayBase, Data, Dimension};
use num_traits::Float;

/// Extension trait for `ArrayBase` providing methods
/// to compute information theory quantities
/// (e.g. entropy, Kullback–Leibler divergence, etc.).
pub trait EntropyExt<A, S, D>
where
S: Data<Elem = A>,
D: Dimension,
{
/// Computes the [entropy] *S* of the array values, defined as
///
/// ```text
/// n
/// S = - ∑ xᵢ ln(xᵢ)
/// i=1
/// ```
///
/// If the array is empty, `None` is returned.
///
/// **Panics** if any element in the array is negative.
///
/// ## Remarks
///
/// The entropy is a measure used in [Information Theory]
/// to describe a probability distribution: it only make sense
/// when the array values sum to 1, with each entry between
/// 0 and 1 (extremes included).
///
/// By definition, *xᵢ ln(xᵢ)* is set to 0 if *xᵢ* is 0.
///
/// [entropy]: https://en.wikipedia.org/wiki/Entropy_(information_theory)
/// [Information Theory]: https://en.wikipedia.org/wiki/Information_theory
fn entropy(&self) -> Option<A>
where
A: Float;

/// Computes the [Kullback-Leibler divergence] *Dₖₗ(p,q)* between two arrays,
/// where `self`=*p*.
///
/// The Kullback-Leibler divergence is defined as:
///
/// ```text
/// n
/// Dₖₗ(p,q) = - ∑ pᵢ ln(qᵢ/pᵢ)
/// i=1
/// ```
///
/// If the arrays are empty or their lengths are not equal, `None` is returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's not good to combine these two cases. A few options:

  • Result<Option<A>, DimensionMismatch> would be fine.

  • I think Result<A, ShapeError> is a bit cleaner (where ShapeError is an enum with Empty and ShapeMismatch variants).

  • Another approach is Option<A>, returning None in the empty case and panicking if q cannot be broadcast to the shape of p. I generally avoid panicking, but I think this is the best option in this case because:

    • It's more consistent with ndarray's methods that operate on two arrays.
    • If the caller passes in arrays of mismatched shapes, it's clearly a bug in their code.
    • If the arrays have mismatched shapes, it's hard to see how the caller would recover from this error case.

    The question then is why we should return None in the empty case instead of panicking. I think this makes sense because it's much more likely than a shape mismatch, doesn't really indicate a bug in the caller, and is more likely to be recoverable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather go for Result<A, ShapeError> - in the end, the caller can choose how to recover from errors (free to call unwrap or expect).
I can foresee various scenarios for both failure modes:

  • empty because you filtered some values out of another array/vec, but now nothing is left;
  • shape mismatch, because you read two different samples from files on disk which you expected to have the same number of values but... they don't (real life stories here 😅).

Copy link
Member Author

@LukeMathWalker LukeMathWalker Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To provide a plausible use case: it might be impossible to recover (let the program resume its expected execution flow) but there might be some actions one might want to perform before panicking for example (logging is the first that comes to my mind or sending a request somewhere in a web application context).
One can always catch the panic using stuff like this but it's less pleasant and not clear at a first glance looking at the function signature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that makes sense. We have to return an Option/Result anyway, so we might as well handle all the error cases without panicking. Along the same lines, should we return an error on negative values instead of panicking? I'd expect negative values to be a bigger issue than shape mismatches, and we're already checking if values are zero.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On that one I am torn, because the cost of checking for negative values scales with the number of elements given that it's an additional check. Do you think it's worth it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking over it again, Result<Option<A>, DimensionMismatch> is probably the best signature, given that most of our methods return an Option<A> with None when arrays are empty. Consistency is probably worth the double wrap (unless we want to use DimensionMismatch::Empty instead of Option in the rest of the API).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize earlier that the ln of a negative number is NaN (except for the noisy float types, which panic). That behavior is fine with me.

Fwiw, I did a quick benchmark of the cost of adding a check for negative numbers, and it's pretty negligible (<2%) (the horizontal axis is arr.len(), and vertical axis is time to compute arr.entropy(); 'entropy2' is the checked implementation):

lines

    fn entropy2(&self) -> Option<A>
        where
            A: Float
    {
        if self.len() == 0 {
            None
        } else {
            let mut negative = false;
            let entropy = self.mapv(
                |x| {
                    if x < A::zero() {
                        negative = true;
                        A::zero()
                    } else if x == A::zero() {
                        A::zero()
                    } else {
                        x * x.ln()
                    }
                }
            ).sum();
            if negative {
                None
            } else {
                Some(-entropy)
            }
        }
    }

(Note that I wouldn't actually return an Option in this case; I'd return a Result instead.)

Of course, if we check for negative numbers, someone might wonder why we don't check for numbers greater than 1 too.

I could go either way on the negative number checks. The explicit check is nice, but it adds complexity and we aren't checking other things such as values greater than 1 or the sum of values not being 1, so I guess I'd lean towards leaving off negative number checks for right now.

Thinking over it again, Result<Option<A>, DimensionMismatch> is probably the best signature, given that most of our methods return an Option<A> with None when arrays are empty.

Okay, that's fine.

unless we want to use DimensionMismatch::Empty instead of Option in the rest of the API

That wouldn't be too bad, although I don't really like returning an error enum when only one of the variants is possible.

///
/// **Panics** if any element in *q* is negative and taking the logarithm of a negative number
/// is a panic cause for `A`.
///
/// ## Remarks
///
/// The Kullback-Leibler divergence is a measure used in [Information Theory]
/// to describe the relationship between two probability distribution: it only make sense
/// when each array sums to 1 with entries between 0 and 1 (extremes included).
///
/// By definition, *pᵢ ln(qᵢ/pᵢ)* is set to 0 if *pᵢ* is 0.
///
/// [Kullback-Leibler divergence]: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
/// [Information Theory]: https://en.wikipedia.org/wiki/Information_theory
fn kl_divergence(&self, q: &Self) -> Option<A>
where
A: Float;

/// Computes the [cross entropy] *H(p,q)* between two arrays,
/// where `self`=*p*.
///
/// The cross entropy is defined as:
///
/// ```text
/// n
/// H(p,q) = - ∑ pᵢ ln(qᵢ)
/// i=1
/// ```
///
/// If the arrays are empty or their lengths are not equal, `None` is returned.
///
/// **Panics** if any element in *q* is negative and taking the logarithm of a negative number
/// is a panic cause for `A`.
///
/// ## Remarks
///
/// The cross entropy is a measure used in [Information Theory]
/// to describe the relationship between two probability distribution: it only make sense
/// when each array sums to 1 with entries between 0 and 1 (extremes included).
///
/// The cross entropy is often used as an objective/loss function in
/// [optimization problems], including [machine learning].
///
/// By definition, *pᵢ ln(qᵢ)* is set to 0 if *pᵢ* is 0.
///
/// [cross entropy]: https://en.wikipedia.org/wiki/Cross-entropy
/// [Information Theory]: https://en.wikipedia.org/wiki/Information_theory
/// [optimization problems]: https://en.wikipedia.org/wiki/Cross-entropy_method
/// [machine learning]: https://en.wikipedia.org/wiki/Cross_entropy#Cross-entropy_error_function_and_logistic_regression
fn cross_entropy(&self, q: &Self) -> Option<A>
where
A: Float;
}


impl<A, S, D> EntropyExt<A, S, D> for ArrayBase<S, D>
where
S: Data<Elem = A>,
D: Dimension,
{
fn entropy(&self) -> Option<A>
where
A: Float
{
if self.len() == 0 {
None
} else {
let entropy = self.map(
|x| {
if *x == A::zero() {
A::zero()
} else {
*x * x.ln()
}
}
).sum();
Some(-entropy)
}
}

fn kl_divergence(&self, q: &Self) -> Option<A>
where
A: Float
{
if (self.len() == 0) | (self.len() != q.len()) {
None
} else {
let kl_divergence: A = self.iter().zip(q.iter()).map(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to use ndarray::Zip because it should have better performance.

Copy link
Member Author

@LukeMathWalker LukeMathWalker Feb 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have always some problems when using Zip - can you help me understand what the compiler is complaining about?
image

(.fold_while is more appropriate, but I was doing a first step just to get Zip there. I have pushed the code, even if broken, so that you can have a look)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it to work by changing
let mut temp = ArrayBase::zeros(self.shape());
into
let mut temp = Array::zeros(self.raw_dim());
but I really want to understand the reason behind the extremely different outcome 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the issue is that self.shape() returns &[usize] which is a slice of unspecified length, so when you call Array::zeros(anything.shape()), you get a dynamic-dimensional array. (There have been a few issues about this in the ndarray repository from confused users, and it would be nice to resolve this raw edge of the API at some point.) Zip requires that all arrays have the same dimensionality, so trying to zip together an array with dimension IxDyn and arrays with dimension D doesn't work (unless you force D = IxDyn).

The issue thing is that calling ArrayBase::zeros() doesn't specify the storage of the array, so the compiler isn't sure what type of owned array to create (e.g. Array or ArcArray). Writing Array::zeros() tells the compiler that you specifically want an Array instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created rust-ndarray/ndarray#591 to help improve the docs in this area.

|(p, q)| {
if *p == A::zero() {
A::zero()
} else {
*p * (*q / *p).ln()
}
}
).collect::<Array1<A>>().sum();
Some(-kl_divergence)
}
}

fn cross_entropy(&self, q: &Self) -> Option<A>
where
A: Float
{
if (self.len() == 0) | (self.len() != q.len()) {
None
} else {
let cross_entropy: A = self.iter().zip(q.iter()).map(
|(p, q)| {
if *p == A::zero() {
A::zero()
} else {
*p * q.ln()
}
}
).collect::<Array1<A>>().sum();
Some(-cross_entropy)
}
}
}

#[cfg(test)]
mod tests {
use super::EntropyExt;
use std::f64;
use approx::assert_abs_diff_eq;
use noisy_float::types::n64;
use ndarray::{array, Array1};

#[test]
fn test_entropy_with_nan_values() {
let a = array![f64::NAN, 1.];
assert!(a.entropy().unwrap().is_nan());
}

#[test]
fn test_entropy_with_empty_array_of_floats() {
let a: Array1<f64> = array![];
assert!(a.entropy().is_none());
}

#[test]
fn test_entropy_with_array_of_floats() {
// Array of probability values - normalized and positive.
let a: Array1<f64> = array![
0.03602474, 0.01900344, 0.03510129, 0.03414964, 0.00525311,
0.03368976, 0.00065396, 0.02906146, 0.00063687, 0.01597306,
0.00787625, 0.00208243, 0.01450896, 0.01803418, 0.02055336,
0.03029759, 0.03323628, 0.01218822, 0.0001873 , 0.01734179,
0.03521668, 0.02564429, 0.02421992, 0.03540229, 0.03497635,
0.03582331, 0.026558 , 0.02460495, 0.02437716, 0.01212838,
0.00058464, 0.00335236, 0.02146745, 0.00930306, 0.01821588,
0.02381928, 0.02055073, 0.01483779, 0.02284741, 0.02251385,
0.00976694, 0.02864634, 0.00802828, 0.03464088, 0.03557152,
0.01398894, 0.01831756, 0.0227171 , 0.00736204, 0.01866295,
];
// Computed using scipy.stats.entropy
let expected_entropy = 3.721606155686918;

assert_abs_diff_eq!(a.entropy().unwrap(), expected_entropy, epsilon = 1e-6);
}

#[test]
fn test_cross_entropy_and_kl_with_nan_values() {
let a = array![f64::NAN, 1.];
let b = array![2., 1.];
assert!(a.cross_entropy(&b).unwrap().is_nan());
assert!(b.cross_entropy(&a).unwrap().is_nan());
assert!(a.kl_divergence(&b).unwrap().is_nan());
assert!(b.kl_divergence(&a).unwrap().is_nan());
}

#[test]
fn test_cross_entropy_and_kl_with_dimension_mismatch() {
let p = array![f64::NAN, 1.];
let q = array![2., 1., 5.];
assert!(q.cross_entropy(&p).is_none());
assert!(p.cross_entropy(&q).is_none());
assert!(q.kl_divergence(&p).is_none());
assert!(p.kl_divergence(&q).is_none());
}

#[test]
fn test_cross_entropy_and_kl_with_empty_array_of_floats() {
let p: Array1<f64> = array![];
let q: Array1<f64> = array![];
assert!(p.cross_entropy(&q).is_none());
assert!(p.kl_divergence(&q).is_none());
}

#[test]
fn test_cross_entropy_and_kl_with_negative_qs() {
let p = array![1.];
let q = array![-1.];
let cross_entropy: f64 = p.cross_entropy(&q).unwrap();
let kl_divergence: f64 = p.kl_divergence(&q).unwrap();
assert!(cross_entropy.is_nan());
assert!(kl_divergence.is_nan());
}

#[test]
#[should_panic]
fn test_cross_entropy_with_noisy_negative_qs() {
let p = array![n64(1.)];
let q = array![n64(-1.)];
p.cross_entropy(&q);
}

#[test]
#[should_panic]
fn test_kl_with_noisy_negative_qs() {
let p = array![n64(1.)];
let q = array![n64(-1.)];
p.kl_divergence(&q);
}

#[test]
fn test_cross_entropy_and_kl_with_zeroes_p() {
let p = array![0., 0.];
let q = array![0., 0.5];
assert_eq!(p.cross_entropy(&q).unwrap(), 0.);
assert_eq!(p.kl_divergence(&q).unwrap(), 0.);
}

#[test]
fn test_cross_entropy_and_kl_with_zeroes_q() {
let p = array![0.5, 0.5];
let q = array![0.5, 0.];
assert_eq!(p.cross_entropy(&q).unwrap(), f64::INFINITY);
assert_eq!(p.kl_divergence(&q).unwrap(), f64::INFINITY);
}

#[test]
fn test_cross_entropy() {
// Arrays of probability values - normalized and positive.
let p: Array1<f64> = array![
0.05340169, 0.02508511, 0.03460454, 0.00352313, 0.07837615, 0.05859495,
0.05782189, 0.0471258, 0.05594036, 0.01630048, 0.07085162, 0.05365855,
0.01959158, 0.05020174, 0.03801479, 0.00092234, 0.08515856, 0.00580683,
0.0156542, 0.0860375, 0.0724246, 0.00727477, 0.01004402, 0.01854399,
0.03504082,
];
let q: Array1<f64> = array![
0.06622616, 0.0478948, 0.03227816, 0.06460884, 0.05795974, 0.01377489,
0.05604812, 0.01202684, 0.01647579, 0.03392697, 0.01656126, 0.00867528,
0.0625685, 0.07381292, 0.05489067, 0.01385491, 0.03639174, 0.00511611,
0.05700415, 0.05183825, 0.06703064, 0.01813342, 0.0007763, 0.0735472,
0.05857833,
];
// Computed using scipy.stats.entropy(p) + scipy.stats.entropy(p, q)
let expected_cross_entropy = 3.385347705020779;

assert_abs_diff_eq!(p.cross_entropy(&q).unwrap(), expected_cross_entropy, epsilon = 1e-6);
}

#[test]
fn test_kl() {
// Arrays of probability values - normalized and positive.
let p: Array1<f64> = array![
0.00150472, 0.01388706, 0.03495376, 0.03264211, 0.03067355,
0.02183501, 0.00137516, 0.02213802, 0.02745017, 0.02163975,
0.0324602 , 0.03622766, 0.00782343, 0.00222498, 0.03028156,
0.02346124, 0.00071105, 0.00794496, 0.0127609 , 0.02899124,
0.01281487, 0.0230803 , 0.01531864, 0.00518158, 0.02233383,
0.0220279 , 0.03196097, 0.03710063, 0.01817856, 0.03524661,
0.02902393, 0.00853364, 0.01255615, 0.03556958, 0.00400151,
0.01335932, 0.01864965, 0.02371322, 0.02026543, 0.0035375 ,
0.01988341, 0.02621831, 0.03564644, 0.01389121, 0.03151622,
0.03195532, 0.00717521, 0.03547256, 0.00371394, 0.01108706,
];
let q: Array1<f64> = array![
0.02038386, 0.03143914, 0.02630206, 0.0171595 , 0.0067072 ,
0.00911324, 0.02635717, 0.01269113, 0.0302361 , 0.02243133,
0.01902902, 0.01297185, 0.02118908, 0.03309548, 0.01266687,
0.0184529 , 0.01830936, 0.03430437, 0.02898924, 0.02238251,
0.0139771 , 0.01879774, 0.02396583, 0.03019978, 0.01421278,
0.02078981, 0.03542451, 0.02887438, 0.01261783, 0.01014241,
0.03263407, 0.0095969 , 0.01923903, 0.0051315 , 0.00924686,
0.00148845, 0.00341391, 0.01480373, 0.01920798, 0.03519871,
0.03315135, 0.02099325, 0.03251755, 0.00337555, 0.03432165,
0.01763753, 0.02038337, 0.01923023, 0.01438769, 0.02082707,
];
// Computed using scipy.stats.entropy(p, q)
let expected_kl = 0.3555862567800096;

assert_abs_diff_eq!(p.kl_divergence(&q).unwrap(), expected_kl, epsilon = 1e-6);
}
}
2 changes: 2 additions & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,12 @@ pub use sort::Sort1dExt;
pub use correlation::CorrelationExt;
pub use histogram::HistogramExt;
pub use summary_statistics::SummaryStatisticsExt;
pub use entropy::EntropyExt;

mod maybe_nan;
mod quantile;
mod sort;
mod correlation;
mod entropy;
mod summary_statistics;
pub mod histogram;