summaryrefslogtreecommitdiff log msg author committer range
diff options
 context: 12345678910152025303540 space: includeignore mode: unifiedssdiffstat only
author committer Eric Wieser 2018-03-16 07:46:49 (GMT) Eric Wieser 2018-03-16 07:49:05 (GMT) 7a3db4909498a40bdeec44c9626924e37c51ed12 (patch) ec575bc95d86dad9aa0c05a9f60d0a546606a90c f42e10405f2354e1776f89402ceae0ad0ab637bb (diff) python-numpy-7a3db4909498a40bdeec44c9626924e37c51ed12.zippython-numpy-7a3db4909498a40bdeec44c9626924e37c51ed12.tar.gzpython-numpy-7a3db4909498a40bdeec44c9626924e37c51ed12.tar.bz2
DOC: Move bin estimator documentation from histogram to histogram_bin_edges
-rw-r--r--numpy/lib/histograms.py182
1 files changed, 75 insertions, 107 deletions
 diff --git a/numpy/lib/histograms.py b/numpy/lib/histograms.pyindex f151a60..7329246 100644--- a/numpy/lib/histograms.py+++ b/numpy/lib/histograms.py@@ -425,6 +425,76 @@ def histogram_bin_edges(a, bins=10, range=None, weights=None): -------- histogram + Notes+ -----+ The methods to estimate the optimal number of bins are well founded+ in literature, and are inspired by the choices R provides for+ histogram visualisation. Note that having the number of bins+ proportional to :math:n^{1/3} is asymptotically optimal, which is+ why it appears in most estimators. These are simply plug-in methods+ that give good starting points for number of bins. In the equations+ below, :math:h is the binwidth and :math:n_h is the number of+ bins. All estimators that compute bin counts are recast to bin width+ using the ptp of the data. The final bin count is obtained from+ np.round(np.ceil(range / h)).++ 'Auto' (maximum of the 'Sturges' and 'FD' estimators)+ A compromise to get a good value. For small datasets the Sturges+ value will usually be chosen, while larger datasets will usually+ default to FD. Avoids the overly conservative behaviour of FD+ and Sturges for small and large datasets respectively.+ Switchover point is usually :math:a.size \approx 1000.++ 'FD' (Freedman Diaconis Estimator)+ .. math:: h = 2 \frac{IQR}{n^{1/3}}++ The binwidth is proportional to the interquartile range (IQR)+ and inversely proportional to cube root of a.size. Can be too+ conservative for small datasets, but is quite good for large+ datasets. The IQR is very robust to outliers.++ 'Scott'+ .. math:: h = \sigma \sqrt[3]{\frac{24 * \sqrt{\pi}}{n}}++ The binwidth is proportional to the standard deviation of the+ data and inversely proportional to cube root of x.size. Can+ be too conservative for small datasets, but is quite good for+ large datasets. The standard deviation is not very robust to+ outliers. Values are very similar to the Freedman-Diaconis+ estimator in the absence of outliers.++ 'Rice'+ .. math:: n_h = 2n^{1/3}++ The number of bins is only proportional to cube root of+ a.size. It tends to overestimate the number of bins and it+ does not take into account data variability.++ 'Sturges'+ .. math:: n_h = \log _{2}n+1++ The number of bins is the base 2 log of a.size. This+ estimator assumes normality of data and is too conservative for+ larger, non-normal datasets. This is the default method in R's+ hist method.++ 'Doane'+ .. math:: n_h = 1 + \log_{2}(n) ++ \log_{2}(1 + \frac{|g_1|}{\sigma_{g_1}})++ g_1 = mean[(\frac{x - \mu}{\sigma})^3]++ \sigma_{g_1} = \sqrt{\frac{6(n - 2)}{(n + 1)(n + 3)}}++ An improved version of Sturges' formula that produces better+ estimates for non-normal datasets. This estimator attempts to+ account for the skew of the data.++ 'Sqrt'+ .. math:: n_h = \sqrt n+ The simplest and fastest estimator. Only takes into account the+ data size.+ Examples -------- >>> arr = np.array([0, 0, 0, 1, 2, 3, 3, 4, 5])@@ -489,44 +559,8 @@ def histogram(a, bins=10, range=None, normed=False, weights=None, .. versionadded:: 1.11.0 - If bins is a string from the list below, histogram will use- the method chosen to calculate the optimal bin width and- consequently the number of bins (see Notes for more detail on- the estimators) from the data that falls within the requested- range. While the bin width will be optimal for the actual data- in the range, the number of bins will be computed to fill the- entire range, including the empty portions. For visualisation,- using the 'auto' option is suggested. Weighted data is not- supported for automated bin size selection.-- 'auto'- Maximum of the 'sturges' and 'fd' estimators. Provides good- all around performance.-- 'fd' (Freedman Diaconis Estimator)- Robust (resilient to outliers) estimator that takes into- account data variability and data size.-- 'doane'- An improved version of Sturges' estimator that works better- with non-normal datasets.-- 'scott'- Less robust estimator that that takes into account data- variability and data size.-- 'rice'- Estimator does not take variability into account, only data- size. Commonly overestimates number of bins required.-- 'sturges'- R's default method, only accounts for data size. Only- optimal for gaussian data and underestimates number of bins- for large non-gaussian datasets.-- 'sqrt'- Square root (of data size) estimator, used by Excel and- other programs for its speed and simplicity.+ If bins is a string, it defines the method used to calculate the+ optimal bin width, as defined by histogram_bin_edges. range : (float, float), optional The lower and upper range of the bins. If not provided, range@@ -537,6 +571,9 @@ def histogram(a, bins=10, range=None, normed=False, weights=None, based on the actual data within range, the bin count will fill the entire range including portions containing no data. normed : bool, optional++ .. deprecated:: 1.6.0+ This keyword is deprecated in NumPy 1.6.0 due to confusing/buggy behavior. It will be removed in NumPy 2.0.0. Use the density keyword instead. If False, the result will contain the@@ -585,75 +622,6 @@ def histogram(a, bins=10, range=None, normed=False, weights=None, the second [2, 3). The last bin, however, is [3, 4], which *includes* 4. - .. versionadded:: 1.11.0-- The methods to estimate the optimal number of bins are well founded- in literature, and are inspired by the choices R provides for- histogram visualisation. Note that having the number of bins- proportional to :math:n^{1/3} is asymptotically optimal, which is- why it appears in most estimators. These are simply plug-in methods- that give good starting points for number of bins. In the equations- below, :math:h is the binwidth and :math:n_h is the number of- bins. All estimators that compute bin counts are recast to bin width- using the ptp of the data. The final bin count is obtained from- np.round(np.ceil(range / h)).-- 'Auto' (maximum of the 'Sturges' and 'FD' estimators)- A compromise to get a good value. For small datasets the Sturges- value will usually be chosen, while larger datasets will usually- default to FD. Avoids the overly conservative behaviour of FD- and Sturges for small and large datasets respectively.- Switchover point is usually :math:a.size \approx 1000.-- 'FD' (Freedman Diaconis Estimator)- .. math:: h = 2 \frac{IQR}{n^{1/3}}-- The binwidth is proportional to the interquartile range (IQR)- and inversely proportional to cube root of a.size. Can be too- conservative for small datasets, but is quite good for large- datasets. The IQR is very robust to outliers.-- 'Scott'- .. math:: h = \sigma \sqrt[3]{\frac{24 * \sqrt{\pi}}{n}}-- The binwidth is proportional to the standard deviation of the- data and inversely proportional to cube root of x.size. Can- be too conservative for small datasets, but is quite good for- large datasets. The standard deviation is not very robust to- outliers. Values are very similar to the Freedman-Diaconis- estimator in the absence of outliers.-- 'Rice'- .. math:: n_h = 2n^{1/3}-- The number of bins is only proportional to cube root of- a.size. It tends to overestimate the number of bins and it- does not take into account data variability.-- 'Sturges'- .. math:: n_h = \log _{2}n+1-- The number of bins is the base 2 log of a.size. This- estimator assumes normality of data and is too conservative for- larger, non-normal datasets. This is the default method in R's- hist method.-- 'Doane'- .. math:: n_h = 1 + \log_{2}(n) +- \log_{2}(1 + \frac{|g_1|}{\sigma_{g_1}})-- g_1 = mean[(\frac{x - \mu}{\sigma})^3]-- \sigma_{g_1} = \sqrt{\frac{6(n - 2)}{(n + 1)(n + 3)}}-- An improved version of Sturges' formula that produces better- estimates for non-normal datasets. This estimator attempts to- account for the skew of the data.-- 'Sqrt'- .. math:: n_h = \sqrt n- The simplest and fastest estimator. Only takes into account the- data size. Examples --------