summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorCharles Harris <charlesr.harris@gmail.com>2018-03-16 18:14:34 (GMT)
committerGitHub <noreply@github.com>2018-03-16 18:14:34 (GMT)
commit1cda13d7d72d4f679d0829c83f3647462a8f726f (patch)
treef557228ada57576b0287429732d1eecea6461c66
parente19e52e0934690323d666704b850dcab1f8c66bf (diff)
parent8a67fa9d64cda97172681c0537faf45500155559 (diff)
downloadpython-numpy-1cda13d7d72d4f679d0829c83f3647462a8f726f.zip
python-numpy-1cda13d7d72d4f679d0829c83f3647462a8f726f.tar.gz
python-numpy-1cda13d7d72d4f679d0829c83f3647462a8f726f.tar.bz2
Merge pull request #10755 from eric-wieser/reduce-histogram-docs
DOC: Move bin estimator documentation from `histogram` to `histogram_bin_edges`
-rw-r--r--numpy/lib/histograms.py184
1 files changed, 76 insertions, 108 deletions
diff --git a/numpy/lib/histograms.py b/numpy/lib/histograms.py
index f151a60..aa067a4 100644
--- a/numpy/lib/histograms.py
+++ b/numpy/lib/histograms.py
@@ -349,7 +349,7 @@ def _search_sorted_inclusive(a, v):
def histogram_bin_edges(a, bins=10, range=None, weights=None):
- """
+ r"""
Function to calculate only the edges of the bins used by the `histogram` function.
Parameters
@@ -425,6 +425,76 @@ def histogram_bin_edges(a, bins=10, range=None, weights=None):
--------
histogram
+ Notes
+ -----
+ The methods to estimate the optimal number of bins are well founded
+ in literature, and are inspired by the choices R provides for
+ histogram visualisation. Note that having the number of bins
+ proportional to :math:`n^{1/3}` is asymptotically optimal, which is
+ why it appears in most estimators. These are simply plug-in methods
+ that give good starting points for number of bins. In the equations
+ below, :math:`h` is the binwidth and :math:`n_h` is the number of
+ bins. All estimators that compute bin counts are recast to bin width
+ using the `ptp` of the data. The final bin count is obtained from
+ ``np.round(np.ceil(range / h))`.
+
+ 'Auto' (maximum of the 'Sturges' and 'FD' estimators)
+ A compromise to get a good value. For small datasets the Sturges
+ value will usually be chosen, while larger datasets will usually
+ default to FD. Avoids the overly conservative behaviour of FD
+ and Sturges for small and large datasets respectively.
+ Switchover point is usually :math:`a.size \approx 1000`.
+
+ 'FD' (Freedman Diaconis Estimator)
+ .. math:: h = 2 \frac{IQR}{n^{1/3}}
+
+ The binwidth is proportional to the interquartile range (IQR)
+ and inversely proportional to cube root of a.size. Can be too
+ conservative for small datasets, but is quite good for large
+ datasets. The IQR is very robust to outliers.
+
+ 'Scott'
+ .. math:: h = \sigma \sqrt[3]{\frac{24 * \sqrt{\pi}}{n}}
+
+ The binwidth is proportional to the standard deviation of the
+ data and inversely proportional to cube root of ``x.size``. Can
+ be too conservative for small datasets, but is quite good for
+ large datasets. The standard deviation is not very robust to
+ outliers. Values are very similar to the Freedman-Diaconis
+ estimator in the absence of outliers.
+
+ 'Rice'
+ .. math:: n_h = 2n^{1/3}
+
+ The number of bins is only proportional to cube root of
+ ``a.size``. It tends to overestimate the number of bins and it
+ does not take into account data variability.
+
+ 'Sturges'
+ .. math:: n_h = \log _{2}n+1
+
+ The number of bins is the base 2 log of ``a.size``. This
+ estimator assumes normality of data and is too conservative for
+ larger, non-normal datasets. This is the default method in R's
+ ``hist`` method.
+
+ 'Doane'
+ .. math:: n_h = 1 + \log_{2}(n) +
+ \log_{2}(1 + \frac{|g_1|}{\sigma_{g_1}})
+
+ g_1 = mean[(\frac{x - \mu}{\sigma})^3]
+
+ \sigma_{g_1} = \sqrt{\frac{6(n - 2)}{(n + 1)(n + 3)}}
+
+ An improved version of Sturges' formula that produces better
+ estimates for non-normal datasets. This estimator attempts to
+ account for the skew of the data.
+
+ 'Sqrt'
+ .. math:: n_h = \sqrt n
+ The simplest and fastest estimator. Only takes into account the
+ data size.
+
Examples
--------
>>> arr = np.array([0, 0, 0, 1, 2, 3, 3, 4, 5])
@@ -489,44 +559,8 @@ def histogram(a, bins=10, range=None, normed=False, weights=None,
.. versionadded:: 1.11.0
- If `bins` is a string from the list below, `histogram` will use
- the method chosen to calculate the optimal bin width and
- consequently the number of bins (see `Notes` for more detail on
- the estimators) from the data that falls within the requested
- range. While the bin width will be optimal for the actual data
- in the range, the number of bins will be computed to fill the
- entire range, including the empty portions. For visualisation,
- using the 'auto' option is suggested. Weighted data is not
- supported for automated bin size selection.
-
- 'auto'
- Maximum of the 'sturges' and 'fd' estimators. Provides good
- all around performance.
-
- 'fd' (Freedman Diaconis Estimator)
- Robust (resilient to outliers) estimator that takes into
- account data variability and data size.
-
- 'doane'
- An improved version of Sturges' estimator that works better
- with non-normal datasets.
-
- 'scott'
- Less robust estimator that that takes into account data
- variability and data size.
-
- 'rice'
- Estimator does not take variability into account, only data
- size. Commonly overestimates number of bins required.
-
- 'sturges'
- R's default method, only accounts for data size. Only
- optimal for gaussian data and underestimates number of bins
- for large non-gaussian datasets.
-
- 'sqrt'
- Square root (of data size) estimator, used by Excel and
- other programs for its speed and simplicity.
+ If `bins` is a string, it defines the method used to calculate the
+ optimal bin width, as defined by `histogram_bin_edges`.
range : (float, float), optional
The lower and upper range of the bins. If not provided, range
@@ -537,6 +571,9 @@ def histogram(a, bins=10, range=None, normed=False, weights=None,
based on the actual data within `range`, the bin count will fill
the entire range including portions containing no data.
normed : bool, optional
+
+ .. deprecated:: 1.6.0
+
This keyword is deprecated in NumPy 1.6.0 due to confusing/buggy
behavior. It will be removed in NumPy 2.0.0. Use the ``density``
keyword instead. If ``False``, the result will contain the
@@ -585,75 +622,6 @@ def histogram(a, bins=10, range=None, normed=False, weights=None,
the second ``[2, 3)``. The last bin, however, is ``[3, 4]``, which
*includes* 4.
- .. versionadded:: 1.11.0
-
- The methods to estimate the optimal number of bins are well founded
- in literature, and are inspired by the choices R provides for
- histogram visualisation. Note that having the number of bins
- proportional to :math:`n^{1/3}` is asymptotically optimal, which is
- why it appears in most estimators. These are simply plug-in methods
- that give good starting points for number of bins. In the equations
- below, :math:`h` is the binwidth and :math:`n_h` is the number of
- bins. All estimators that compute bin counts are recast to bin width
- using the `ptp` of the data. The final bin count is obtained from
- ``np.round(np.ceil(range / h))`.
-
- 'Auto' (maximum of the 'Sturges' and 'FD' estimators)
- A compromise to get a good value. For small datasets the Sturges
- value will usually be chosen, while larger datasets will usually
- default to FD. Avoids the overly conservative behaviour of FD
- and Sturges for small and large datasets respectively.
- Switchover point is usually :math:`a.size \approx 1000`.
-
- 'FD' (Freedman Diaconis Estimator)
- .. math:: h = 2 \frac{IQR}{n^{1/3}}
-
- The binwidth is proportional to the interquartile range (IQR)
- and inversely proportional to cube root of a.size. Can be too
- conservative for small datasets, but is quite good for large
- datasets. The IQR is very robust to outliers.
-
- 'Scott'
- .. math:: h = \sigma \sqrt[3]{\frac{24 * \sqrt{\pi}}{n}}
-
- The binwidth is proportional to the standard deviation of the
- data and inversely proportional to cube root of ``x.size``. Can
- be too conservative for small datasets, but is quite good for
- large datasets. The standard deviation is not very robust to
- outliers. Values are very similar to the Freedman-Diaconis
- estimator in the absence of outliers.
-
- 'Rice'
- .. math:: n_h = 2n^{1/3}
-
- The number of bins is only proportional to cube root of
- ``a.size``. It tends to overestimate the number of bins and it
- does not take into account data variability.
-
- 'Sturges'
- .. math:: n_h = \log _{2}n+1
-
- The number of bins is the base 2 log of ``a.size``. This
- estimator assumes normality of data and is too conservative for
- larger, non-normal datasets. This is the default method in R's
- ``hist`` method.
-
- 'Doane'
- .. math:: n_h = 1 + \log_{2}(n) +
- \log_{2}(1 + \frac{|g_1|}{\sigma_{g_1}})
-
- g_1 = mean[(\frac{x - \mu}{\sigma})^3]
-
- \sigma_{g_1} = \sqrt{\frac{6(n - 2)}{(n + 1)(n + 3)}}
-
- An improved version of Sturges' formula that produces better
- estimates for non-normal datasets. This estimator attempts to
- account for the skew of the data.
-
- 'Sqrt'
- .. math:: n_h = \sqrt n
- The simplest and fastest estimator. Only takes into account the
- data size.
Examples
--------