statisticsMathematical statistics functions数理统计函数

New in version 3.4.版本3.4中新增。

Source code: Lib/statistics.py


This module provides functions for calculating mathematical statistics of numeric (Real-valued) data.此模块提供用于计算数值(Real值)数据的数理统计的函数。

The module is not intended to be a competitor to third-party libraries such as NumPy, SciPy, or proprietary full-featured statistics packages aimed at professional statisticians such as Minitab, SAS and Matlab. 该模块无意成为第三方库(如NumPySciPy)或针对专业统计学家(如Minitab、SAS和Matlab)的专有全功能统计软件包的竞争对手。It is aimed at the level of graphing and scientific calculators.它旨在提高绘图和科学计算器的水平。

Unless explicitly noted, these functions support int, float, Decimal and Fraction. 除非明确说明,否则这些函数支持intfloatDecimalFractionBehaviour with other types (whether in the numeric tower or not) is currently unsupported. 目前不支持其他类型的行为(无论是否在数字塔中)。Collections with a mix of types are also undefined and implementation-dependent. 具有混合类型的集合也未定义,并且依赖于实现。If your input data consists of mixed types, you may be able to use map() to ensure a consistent result, for example: map(float, input_data).如果输入数据由混合类型组成,则可以使用map()确保结果一致,例如:map(float, input_data)

Averages and measures of central location中心位置的平均值和测量值

These functions calculate an average or typical value from a population or sample.这些函数计算总体或样本的平均值或典型值。

mean()

Arithmetic mean (“average”) of data.数据的算术平均值(“平均值”)。

fmean()

Fast, floating point arithmetic mean.快速浮点算术平均值。

geometric_mean()

Geometric mean of data.数据的几何平均值。

harmonic_mean()

Harmonic mean of data.数据的谐波平均值。

median()

Median (middle value) of data.数据的中值(中间值)。

median_low()

Low median of data.数据中位数低。

median_high()

High median of data.数据中位数高。

median_grouped()

Median, or 50th percentile, of grouped data.分组数据的中位数或50%。

mode()

Single mode (most common value) of discrete or nominal data.离散或标称数据的单模(最常见值)。

multimode()

List of modes (most common values) of discrete or nominal data.离散或标称数据的模式(最常见值)列表。

quantiles()

Divide data into intervals with equal probability.以相等的概率将数据划分为区间。

Measures of spread扩散措施

These functions calculate a measure of how much the population or sample tends to deviate from the typical or average values.这些函数计算总体或样本偏离典型值或平均值的程度。

pstdev()

Population standard deviation of data.数据的总体标准差。

pvariance()

Population variance of data.数据的总体方差。

stdev()

Sample standard deviation of data.数据的样本标准差。

variance()

Sample variance of data.数据的样本方差。

Statistics for relations between two inputs两个输入之间关系的统计信息

These functions calculate statistics regarding relations between two inputs.这些函数计算有关两个输入之间关系的统计信息。

covariance()

Sample covariance for two variables.两个变量的样本协方差。

correlation()

Pearson’s correlation coefficient for two variables.两个变量的皮尔逊相关系数。

linear_regression()

Slope and intercept for simple linear regression.简单线性回归的斜率和截距。

Function details函数详细信息

Note: The functions do not require the data given to them to be sorted. 注:这些函数不需要对提供给它们的数据进行排序。However, for reading convenience, most of the examples show sorted sequences.然而,为了便于阅读,大多数示例显示了排序序列。

statistics.mean(data)

Return the sample arithmetic mean of data which can be a sequence or iterable.返回data的样本算术平均值,该平均值可以是序列,也可以是可替换的。

The arithmetic mean is the sum of the data divided by the number of data points. 算术平均值是数据的总和除以数据点的数量。It is commonly called “the average”, although it is only one of many different mathematical averages. 它通常被称为“平均值”,尽管它只是许多不同数学平均值中的一个。It is a measure of the central location of the data.它是数据中心位置的度量。

If data is empty, StatisticsError will be raised.如果data为空,则会引发StatisticsError

Some examples of use:一些使用示例:

>>> mean([1, 2, 3, 4, 4])
2.8
>>> mean([-1.0, 2.5, 3.25, 5.75])
2.625
>>> from fractions import Fraction as F
>>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
Fraction(13, 21)

>>> from decimal import Decimal as D
>>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
Decimal('0.5625')

Note

The mean is strongly affected by outliers and is not necessarily a typical example of the data points. 平均值受到异常值的强烈影响,不一定是数据点的典型示例。For a more robust, although less efficient, measure of central tendency, see median().有关更稳健但效率较低的中心趋势度量,请参阅median()

The sample mean gives an unbiased estimate of the true population mean, so that when taken on average over all the possible samples, mean(sample) converges on the true mean of the entire population. 样本均值给出了真实总体均值的无偏估计,因此当对所有可能样本进行平均时,mean(sample)收敛于整个总体的真实均值。If data represents the entire population rather than a sample, then mean(data) is equivalent to calculating the true population mean μ.如果data代表整个总体而不是样本,则mean(data)相当于计算真实的总体平均值μ。

statistics.fmean(data)

Convert data to floats and compute the arithmetic mean.data转换为浮点并计算算术平均值。

This runs faster than the mean() function and it always returns a float. 它比mean()函数运行得更快,并且总是返回floatThe data may be a sequence or iterable. data可以是序列或可迭代对象。If the input dataset is empty, raises a StatisticsError.如果输入数据集为空,则会引发StatisticsError

>>> fmean([3.5, 4.0, 5.25])
4.25

New in version 3.8.版本3.8中新增。

statistics.geometric_mean(data)

Convert data to floats and compute the geometric mean.data转换为浮点并计算几何平均值。

The geometric mean indicates the central tendency or typical value of the data using the product of the values (as opposed to the arithmetic mean which uses their sum).几何平均值表示使用值乘积的data的中心趋势或典型值(与使用其总和的算术平均值相反)。

Raises a StatisticsError if the input dataset is empty, if it contains a zero, or if it contains a negative value. 如果输入数据集为空、包含零或包含负值,则引发StatisticsErrorThe data may be a sequence or iterable.data可以是序列或可迭代对象。

No special efforts are made to achieve exact results. 没有作出特别努力来实现确切的结果。(However, this may change in the future.)(然而,这在未来可能会改变。)

>>> round(geometric_mean([54, 24, 36]), 1)
36.0

New in version 3.8.版本3.8中新增。

statistics.harmonic_mean(data, weights=None)

Return the harmonic mean of data, a sequence or iterable of real-valued numbers. 返回data的调和平均值,实值数的序列或可迭代对象。If weights is omitted or None, then equal weighting is assumed.如果忽略weightsweightsNone,则假设权重相等。

The harmonic mean is the reciprocal of the arithmetic mean() of the reciprocals of the data. 调和平均值是数据倒数的算术mean()的倒数。For example, the harmonic mean of three values a, b and c will be equivalent to 3/(1/a + 1/b + 1/c). 例如,三个值abc的谐波平均值将等于3/(1/a + 1/b + 1/c)If one of the values is zero, the result will be zero.如果其中一个值为零,则结果将为零。

The harmonic mean is a type of average, a measure of the central location of the data. 调和平均值是一种平均值,是数据中心位置的度量。It is often appropriate when averaging ratios or rates, for example speeds.当平均比率或速率(例如速度)时,这通常是合适的。

Suppose a car travels 10 km at 40 km/hr, then another 10 km at 60 km/hr. 假设一辆汽车以40公里/小时的速度行驶10公里,然后以60公里/小时的速度再行驶10公里。What is the average speed?平均速度是多少?

>>> harmonic_mean([40, 60])
48.0

Suppose a car travels 40 km/hr for 5 km, and when traffic clears, speeds-up to 60 km/hr for the remaining 30 km of the journey. 假设一辆汽车在5公里内以40公里/小时的速度行驶,当交通畅通时,在剩余的30公里内以60公里/小时的速度行驶。What is the average speed?平均速度是多少?

>>> harmonic_mean([40, 60], weights=[5, 30])
56.0

StatisticsError is raised if data is empty, any element is less than zero, or if the weighted sum isn’t positive.如果data为空,任何元素小于零,或者加权和不是正,则会产StatisticsError

The current algorithm has an early-out when it encounters a zero in the input. 当前算法在输入中遇到零时会提前退出。This means that the subsequent inputs are not tested for validity. (This behavior may change in the future.)这意味着不测试后续输入的有效性。(这种行为将来可能会改变。)

New in version 3.6.版本3.6中新增。

Changed in version 3.10:版本3.10中更改: Added support for weights.增加了对weights的支持。

statistics.median(data)

Return the median (middle value) of numeric data, using the common “mean of middle two” method. 使用常用的“中间二值均值”方法返回数值数据的中值(中间值)。If data is empty, StatisticsError is raised. 如果data为空,则会引发StatisticsErrordata can be a sequence or iterable.data可以是序列或可迭代的。

The median is a robust measure of central location and is less affected by the presence of outliers. 中值是中心位置的稳健度量,受异常值的影响较小。When the number of data points is odd, the middle data point is returned:当数据点数为奇数时,返回中间数据点:

>>> median([1, 3, 5])
3

When the number of data points is even, the median is interpolated by taking the average of the two middle values:当数据点的数量为偶数时,通过取两个中间值的平均值来插值中值:

>>> median([1, 3, 5, 7])
4.0

This is suited for when your data is discrete, and you don’t mind that the median may not be an actual data point.这适用于数据离散的情况,并且您不介意中值可能不是实际的数据点。

If the data is ordinal (supports order operations) but not numeric (doesn’t support addition), consider using median_low() or median_high() instead.如果数据是有序的(支持顺序运算),但不是数字的(不支持加法),请考虑改用median_low()median_high()

statistics.median_low(data)

Return the low median of numeric data. 返回数值数据的低中位数。If data is empty, StatisticsError is raised. 如果data为空,则会引发StatisticsErrordata can be a sequence or iterable.data可以是序列或可迭代对象。

The low median is always a member of the data set. 低中位数始终是数据集的一员。When the number of data points is odd, the middle value is returned. 当数据点数为奇数时,返回中间值。When it is even, the smaller of the two middle values is returned.当其为偶数时,返回两个中间值中较小的一个。

>>> median_low([1, 3, 5])
3
>>> median_low([1, 3, 5, 7])
3

Use the low median when your data are discrete and you prefer the median to be an actual data point rather than interpolated.当数据是离散的,并且希望中值是实际数据点而不是插值数据点时,请使用低中值。

statistics.median_high(data)

Return the high median of data. 返回数据的高中位数。If data is empty, StatisticsError is raised. 如果data为空,则会引发StatisticsErrordata can be a sequence or iterable.data可以是序列或可迭代对象。

The high median is always a member of the data set. 高中位数始终是数据集的一员。When the number of data points is odd, the middle value is returned. 当数据点数为奇数时,返回中间值。When it is even, the larger of the two middle values is returned.当它为偶数时,返回两个中间值中的较大值。

>>> median_high([1, 3, 5])
3
>>> median_high([1, 3, 5, 7])
5

Use the high median when your data are discrete and you prefer the median to be an actual data point rather than interpolated.当数据是离散的,并且希望中值是实际数据点而不是插值数据点时,请使用高中值。

statistics.median_grouped(data, interval=1)

Return the median of grouped continuous data, calculated as the 50th percentile, using interpolation. 使用插值返回分组连续数据的中位数,计算为第50百分位。If data is empty, StatisticsError is raised. 如果data为空,则会引发StatisticsErrordata can be a sequence or iterable.data可以是序列或可迭代对象。

>>> median_grouped([52, 52, 53, 54])
52.5

In the following example, the data are rounded, so that each value represents the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5–1.5, 2 is the midpoint of 1.5–2.5, 3 is the midpoint of 2.5–3.5, etc. 在以下示例中,数据被舍入,因此每个值表示数据类的中点,例如1是类0.5-1.5的中点,2是1.5-2.5的中点,3是2.5-3.5的中点,等等。With the data given, the middle value falls somewhere in the class 3.5–4.5, and interpolation is used to estimate it:在给定数据的情况下,中间值位于3.5-4.5类中的某个位置,并使用插值对其进行估计:

>>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
3.7

Optional argument interval represents the class interval, and defaults to 1. 可选参数interval表示类间隔,默认为1。Changing the class interval naturally will change the interpolation:自然更改类间隔将更改插值:

>>> median_grouped([1, 3, 3, 5, 7], interval=1)
3.25
>>> median_grouped([1, 3, 3, 5, 7], interval=2)
3.5

This function does not check whether the data points are at least interval apart.此函数不检查数据点是否至少间隔interval

CPython implementation detail:CPython实施细节: Under some circumstances, median_grouped() may coerce data points to floats. 在某些情况下,median_grouped()可能会强制数据点浮动。This behaviour is likely to change in the future.这种行为将来可能会改变。

See also

  • “Statistics for the Behavioral Sciences”, Frederick J Gravetter and Larry B Wallnau (8th Edition).“行为科学统计”,弗雷德里克J格雷维特和拉里B沃尔诺(第8版)。

  • The SSMEDIAN function in the Gnome Gnumeric spreadsheet, including this discussion.Gnome Gnumeric电子表格中的SSMEDIAN函数,包括本讨论。

statistics.mode(data)

Return the single most common data point from discrete or nominal data. 从离散或标称data返回单个最常见的数据点。The mode (when it exists) is the most typical value and serves as a measure of central location.模式(存在时)是最典型的值,可作为中心位置的度量。

If there are multiple modes with the same frequency, returns the first one encountered in the data. 如果有多个频率相同的模式,则返回data中遇到的第一个模式。If the smallest or largest of those is desired instead, use min(multimode(data)) or max(multimode(data)). 如果需要其中的最小值或最大值,则使用min(multimode(data))max(multimode(data))If the input data is empty, StatisticsError is raised.如果输入data为空,则会引发StatisticsError

mode assumes discrete data and returns a single value. 假设离散数据并返回单个值。This is the standard treatment of the mode as commonly taught in schools:这是学校通常教授的模式的标准处理方式:

>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
3

The mode is unique in that it is the only statistic in this package that also applies to nominal (non-numeric) data:该模式的独特之处在于,它是该软件包中唯一也适用于标称(非数字)数据的统计数据:

>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
'red'

Changed in version 3.8:版本3.8中更改: Now handles multimodal datasets by returning the first mode encountered. 现在通过返回遇到的第一个模式来处理多模式数据集。Formerly, it raised StatisticsError when more than one mode was found.以前,当发现多个模式时,它会产生StatisticsError

statistics.multimode(data)

Return a list of the most frequently occurring values in the order they were first encountered in the data. 按在data中首次遇到的顺序返回最常出现的值的列表。Will return more than one result if there are multiple modes or an empty list if the data is empty:如果有多个模式,则返回多个结果;如果data为空,则返回空列表:

>>> multimode('aabbbbccddddeeffffgg')
['b', 'd', 'f']
>>> multimode('')
[]

New in version 3.8.版本3.8中新增。

statistics.pstdev(data, mu=None)

Return the population standard deviation (the square root of the population variance). 返回总体标准差(总体方差的平方根)。See pvariance() for arguments and other details.有关参数和其他详细信息,请参阅pvariance()

>>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
0.986893273527251
statistics.pvariance(data, mu=None)

Return the population variance of data, a non-empty sequence or iterable of real-valued numbers. 返回data的总体方差、非空序列或实值数的可迭代对象。Variance, or second moment about the mean, is a measure of the variability (spread or dispersion) of data. 方差或关于平均值的二阶矩是数据可变性(扩散或离散)的度量。A large variance indicates that the data is spread out; a small variance indicates it is clustered closely around the mean.较大的方差表明数据分散;较小的方差表明它紧密围绕均值聚集。

If the optional second argument mu is given, it is typically the mean of the data. 如果给出了可选的第二个参数mu,它通常是data的平均值。It can also be used to compute the second moment around a point that is not the mean. 它还可以用于计算非平均值点周围的二阶矩。If it is missing or None (the default), the arithmetic mean is automatically calculated.如果缺失或None(默认值),则自动计算算术平均值。

Use this function to calculate the variance from the entire population. 使用此函数计算整个总体的方差。To estimate the variance from a sample, the variance() function is usually a better choice.为了从样本中估计方差,variance()函数通常是更好的选择。

Raises StatisticsError if data is empty.如果data为空,则引发StatisticsError

Examples:示例:

>>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
>>> pvariance(data)
1.25

If you have already calculated the mean of your data, you can pass it as the optional second argument mu to avoid recalculation:如果已经计算了数据的平均值,可以将其作为可选的第二个参数mu传递,以避免重新计算:

>>> mu = mean(data)
>>> pvariance(data, mu)
1.25

Decimals and Fractions are supported:支持小数和分数:

>>> from decimal import Decimal as D
>>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
Decimal('24.815')
>>> from fractions import Fraction as F
>>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
Fraction(13, 72)

Note

When called with the entire population, this gives the population variance σ². 当与整个人口一起调用时,这会给出人口方差σ²。When called on a sample instead, this is the biased sample variance s², also known as variance with N degrees of freedom.当调用样本时,这是有偏样本方差s²,也称为具有N个自由度的方差。

If you somehow know the true population mean μ, you may use this function to calculate the variance of a sample, giving the known population mean as the second argument. 如果你知道真实的总体平均值μ,你可以使用这个函数来计算样本的方差,将已知的总体平均值作为第二个参数。Provided the data points are a random sample of the population, the result will be an unbiased estimate of the population variance.假设数据点是总体的随机样本,则结果将是总体方差的无偏估计。

statistics.stdev(data, xbar=None)

Return the sample standard deviation (the square root of the sample variance). 返回样本标准偏差(样本方差的平方根)。See variance() for arguments and other details.有关参数和其他详细信息,请参阅variance()

>>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
1.0810874155219827
statistics.variance(data, xbar=None)

Return the sample variance of data, an iterable of at least two real-valued numbers. 返回data的样本方差,至少两个实值数的可数。Variance, or second moment about the mean, is a measure of the variability (spread or dispersion) of data. 方差或关于平均值的二阶矩是数据可变性(扩散或离散)的度量。A large variance indicates that the data is spread out; a small variance indicates it is clustered closely around the mean.较大的方差表明数据分散;较小的方差表明它紧密围绕均值聚集。

If the optional second argument xbar is given, it should be the mean of data. 如果给出可选的第二个参数xbar,则它应该是data的平均值。If it is missing or None (the default), the mean is automatically calculated.如果缺失或None(默认值),则自动计算平均值。

Use this function when your data is a sample from a population. 当您的数据是来自人口的样本时,请使用此函数。To calculate the variance from the entire population, see pvariance().要计算整个总体的方差,请参阅pvariance()

Raises StatisticsError if data has fewer than two values.如果data少于两个值,则引发StatisticsError

Examples:示例:

>>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
>>> variance(data)
1.3720238095238095

If you have already calculated the mean of your data, you can pass it as the optional second argument xbar to avoid recalculation:如果已经计算了数据的平均值,可以将其作为可选的第二个参数xbar传递,以避免重新计算:

>>> m = mean(data)
>>> variance(data, m)
1.3720238095238095

This function does not attempt to verify that you have passed the actual mean as xbar. 此函数不会尝试验证您是否已将实际平均值传递为xbarUsing arbitrary values for xbar can lead to invalid or impossible results.xbar使用任意值可能会导致无效或不可能的结果。

Decimal and Fraction values are supported:支持小数和分数值:

>>> from decimal import Decimal as D
>>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
Decimal('31.01875')
>>> from fractions import Fraction as F
>>> variance([F(1, 6), F(1, 2), F(5, 3)])
Fraction(67, 108)

Note

This is the sample variance s² with Bessel’s correction, also known as variance with N-1 degrees of freedom. 这是贝塞尔校正的样本方差s,也称为N-1自由度方差。Provided that the data points are representative (e.g. independent and identically distributed), the result should be an unbiased estimate of the true population variance.如果数据点具有代表性(例如独立且相同分布),则结果应为真实总体方差的无偏估计。

If you somehow know the actual population mean μ you should pass it to the pvariance() function as the mu parameter to get the variance of a sample.如果您知道实际的总体均值,则应将其作为mu参数传递给pvariance()函数,以获得样本的方差。

statistics.quantiles(data, *, n=4, method='exclusive')

Divide data into n continuous intervals with equal probability. 以相等的概率将data划分为n个连续区间。Returns a list of n - 1 cut points separating the intervals.返回分隔区间的n-1切点列表。

Set n to 4 for quartiles (the default). 将四分位数的n设置为4(默认值)。Set n to 10 for deciles. n设置为10表示十分位数。Set n to 100 for percentiles which gives the 99 cuts points that separate data into 100 equal sized groups. 将百分位数的n设置为100,这将给出99个切点,将data分为100个大小相等的组。Raises StatisticsError if n is not least 1.如果n不小于1,则引发StatisticsError

The data can be any iterable containing sample data. data可以是包含样本数据的任何iterable。For meaningful results, the number of data points in data should be larger than n. 为了获得有意义的结果,数据中的数据点数量应大于nRaises StatisticsError if there are not at least two data points.如果没有至少两个数据点,则引发StatisticsError

The cut points are linearly interpolated from the two nearest data points. 切割点从最近的两个数据点进行线性插值。For example, if a cut point falls one-third of the distance between two sample values, 100 and 112, the cut-point will evaluate to 104.例如,如果切割点落在两个样本值100112之间距离的三分之一,则切割点将计算为104

The method for computing quantiles can be varied depending on whether the data includes or excludes the lowest and highest possible values from the population.计算分位数的method可以根据data是否包括总体中可能的最低值和最高值而有所不同。

The default method is “exclusive” and is used for data sampled from a population that can have more extreme values than found in the samples. 默认method为“排除”,用于从可能具有比样本中更多极值的总体中采样的数据。The portion of the population falling below the i-th of m sorted data points is computed as i / (m + 1). 在m个排序数据点中,低于第i个的部分计算为i / (m + 1)Given nine sample values, the method sorts them and assigns the following percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%.给定九个样本值,该方法对其进行排序并指定以下百分比:10%、20%、30%、40%、50%、60%、70%、80%、90%。

Setting the method to “inclusive” is used for describing population data or for samples that are known to include the most extreme values from the population. method设置为“包含”用于描述总体数据或已知包含总体中最极值的样本。The minimum value in data is treated as the 0th percentile and the maximum value is treated as the 100th percentile. data中的最小值被视为第0个百分点,最大值被视为第100个百分点。The portion of the population falling below the i-th of m sorted data points is computed as (i - 1) / (m - 1). m个排序数据点中,低于第i个的部分计算为(i - 1) / (m - 1)Given 11 sample values, the method sorts them and assigns the following percentiles: 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.给定11个样本值,该方法对其进行排序,并指定以下百分比:0%、10%、20%、30%、40%、50%、60%、70%、80%、90%、100%。

# Decile cut points for empirically sampled data
>>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
... 103, 107, 101, 81, 109, 104]
>>> [round(q, 1) for q in quantiles(data, n=10)]
[81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]

New in version 3.8.版本3.8中新增。

statistics.covariance(x, y, /)

Return the sample covariance of two inputs x and y. 返回两个输入xy的样本协方差。Covariance is a measure of the joint variability of two inputs.协方差是两个输入的联合可变性的度量。

Both inputs must be of the same length (no less than two), otherwise StatisticsError is raised.两个输入必须具有相同的长度(不少于两个),否则会产生StatisticsError

Examples:示例:

>>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> y = [1, 2, 3, 1, 2, 3, 1, 2, 3]
>>> covariance(x, y)
0.75
>>> z = [9, 8, 7, 6, 5, 4, 3, 2, 1]
>>> covariance(x, z)
-7.5
>>> covariance(z, x)
-7.5

New in version 3.10.版本3.10中新增。

statistics.correlation(x, y, /)

Return the Pearson’s correlation coefficient for two inputs. 返回两个输入的皮尔逊相关系数Pearson’s correlation coefficient r takes values between -1 and +1. 皮尔逊的相关系数r取值在-1和+1之间。It measures the strength and direction of the linear relationship, where +1 means very strong, positive linear relationship, -1 very strong, negative linear relationship, and 0 no linear relationship.它测量线性关系的强度和方向,其中+1表示非常强的正线性关系,-1非常强的负线性关系,0表示非线性关系。

Both inputs must be of the same length (no less than two), and need not to be constant, otherwise StatisticsError is raised.两个输入必须具有相同的长度(不少于两个),并且不需要恒定,否则会产生StatisticsError

Examples:示例:

>>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> y = [9, 8, 7, 6, 5, 4, 3, 2, 1]
>>> correlation(x, x)
1.0
>>> correlation(x, y)
-1.0

New in version 3.10.版本3.10中新增。

statistics.linear_regression(x, y, /)

Return the slope and intercept of simple linear regression parameters estimated using ordinary least squares. 返回使用普通最小二乘法估计的简单线性回归参数的斜率和截距。Simple linear regression describes the relationship between an independent variable x and a dependent variable y in terms of this linear function:简单线性回归根据该线性函数描述自变量x和因变量y之间的关系:

y = slope * x + intercept + noise

where slope and intercept are the regression parameters that are estimated, and noise represents the variability of the data that was not explained by the linear regression (it is equal to the difference between predicted and actual values of the dependent variable).其中slopeintercept是估计的回归参数,noise表示线性回归无法解释的数据的可变性(等于因变量的预测值和实际值之间的差值)。

Both inputs must be of the same length (no less than two), and the independent variable x cannot be constant; otherwise a StatisticsError is raised.两个输入必须具有相同的长度(不少于两个),并且自变量x不能是常数;否则会产生StatisticsError

For example, we can use the release dates of the Monty Python films to predict the cumulative number of Monty Python films that would have been produced by 2019 assuming that they had kept the pace.例如,我们可以使用巨蟒电影的发行日期来预测到2019年,假设巨蟒电影保持了这一速度,那么巨蟒电影的累计制作数量。

>>> year = [1971, 1975, 1979, 1982, 1983]
>>> films_total = [1, 2, 3, 4, 5]
>>> slope, intercept = linear_regression(year, films_total)
>>> round(slope * 2019 + intercept)
16

New in version 3.10.版本3.10中新增。

Exceptions异常

A single exception is defined:定义了一个例外:

exceptionstatistics.StatisticsError

Subclass of ValueError for statistics-related exceptions.统计相关异常的ValueError子类。

NormalDist objects对象

NormalDist is a tool for creating and manipulating normal distributions of a random variable. 是一种用于创建和操作随机变量正态分布的工具。It is a class that treats the mean and standard deviation of data measurements as a single entity.它是一个将数据测量的平均值和标准差视为单个实体的类。

Normal distributions arise from the Central Limit Theorem and have a wide range of applications in statistics.正态分布起源于中心极限定理,在统计学中有着广泛的应用。

classstatistics.NormalDist(mu=0.0, sigma=1.0)

Returns a new NormalDist object where mu represents the arithmetic mean and sigma represents the standard deviation.返回一个新的NormalDist对象,其中mu表示算术平均值sigma表示标准差

If sigma is negative, raises StatisticsError.如果sigma为负,则会产生StatisticsError

mean

A read-only property for the arithmetic mean of a normal distribution.正态分布算术平均值的只读属性。

median

A read-only property for the median of a normal distribution.正态分布中值的只读属性。

mode

A read-only property for the mode of a normal distribution.正态分布模式的只读属性。

stdev

A read-only property for the standard deviation of a normal distribution.正态分布标准差的只读属性。

variance

A read-only property for the variance of a normal distribution. 正态分布方差的只读属性。Equal to the square of the standard deviation.等于标准差的平方。

classmethodfrom_samples(data)

Makes a normal distribution instance with mu and sigma parameters estimated from the data using fmean() and stdev().使用fmean()stdev()data中估计musigma参数,生成正态分布实例。

The data can be any iterable and should consist of values that can be converted to type float. data可以是任何可迭代对象,并且应该由可以转换为float类型的值组成。If data does not contain at least two elements, raises StatisticsError because it takes at least one point to estimate a central value and at least two points to estimate dispersion.如果data不包含至少两个元素,则会产生StatisticsError,因为估计中心值至少需要一个点,估计离散度至少需要两个点。

samples(n, *, seed=None)

Generates n random samples for a given mean and standard deviation. 为给定的平均值和标准差生成n个随机样本。Returns a list of float values.返回floatlist

If seed is given, creates a new instance of the underlying random number generator. 如果给定seed,则创建基础随机数生成器的新实例。This is useful for creating reproducible results, even in a multi-threading context.这对于创建可再现的结果非常有用,即使在多线程环境中也是如此。

pdf(x)

Using a probability density function (pdf), compute the relative likelihood that a random variable X will be near the given value x. 使用概率密度函数(pdf),计算随机变量X接近给定值X的相对可能性。Mathematically, it is the limit of the ratio P(x <= X < x+dx) / dx as dx approaches zero.从数学上讲,当dx接近零时,它是比值P(x <= X < x+dx) / dx的极限。

The relative likelihood is computed as the probability of a sample occurring in a narrow range divided by the width of the range (hence the word “density”). 相对似然的计算方法是样本在狭窄范围内发生的概率除以范围的宽度(因此称为“密度”)。Since the likelihood is relative to other points, its value can be greater than 1.0.由于可能性相对于其他点,其值可以大于1.0

cdf(x)

Using a cumulative distribution function (cdf), compute the probability that a random variable X will be less than or equal to x. 使用累积分布函数(cdf),计算随机变量x小于或等于x的概率。Mathematically, it is written P(X <= x).从数学上讲,它写为P(X <= x)

inv_cdf(p)

Compute the inverse cumulative distribution function, also known as the quantile function or the percent-point function. 计算逆累积分布函数,也称为分位数函数百分比点函数Mathematically, it is written x : P(X <= x) = p.从数学上讲,它被写成x : P(X <= x) = p

Finds the value x of the random variable X such that the probability of the variable being less than or equal to that value equals the given probability p.求随机变量x的值X,使变量小于或等于该值的概率等于给定概率p

overlap(other)

Measures the agreement between two normal probability distributions. 衡量两个正态概率分布之间的一致性。Returns a value between 0.0 and 1.0 giving the overlapping area for the two probability density functions.返回一个介于0.0和1.0之间的值,给出两个概率密度函数的重叠区域

quantiles(n=4)

Divide the normal distribution into n continuous intervals with equal probability. 将正态分布以相等的概率划分为n个连续区间。Returns a list of (n - 1) cut points separating the intervals.返回分隔区间的(n-1)切点列表。

Set n to 4 for quartiles (the default). 将四分位数的n设置为4(默认值)。Set n to 10 for deciles. n设置为10表示十分位数。Set n to 100 for percentiles which gives the 99 cuts points that separate the normal distribution into 100 equal sized groups.将百分位数的n设置为100,给出99个切点,将正态分布分成100个大小相等的组。

zscore(x)

Compute the Standard Score describing x in terms of the number of standard deviations above or below the mean of the normal distribution: (x - mean) / stdev.根据高于或低于正态分布平均值的标准偏差数计算描述x标准分数(x - mean) / stdev

New in version 3.9.版本3.9中新增。

Instances of NormalDist support addition, subtraction, multiplication and division by a constant. NormalDist实例支持加、减、乘和除常数。These operations are used for translation and scaling. 这些操作用于平移和缩放。For example:例如:

>>> temperature_february = NormalDist(5, 2.5)             # Celsius
>>> temperature_february * (9/5) + 32 # Fahrenheit
NormalDist(mu=41.0, sigma=4.5)

Dividing a constant by an instance of NormalDist is not supported because the result wouldn’t be normally distributed.不支持将常数除以NormalDist实例,因为结果不是正态分布的。

Since normal distributions arise from additive effects of independent variables, it is possible to add and subtract two independent normally distributed random variables represented as instances of NormalDist. 由于正态分布是由自变量的加性效应引起的,因此可以加上和减去两个独立的正态分布随机变量,这两个变量表示为NormalDist(正态分布)的实例。For example:例如:

>>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
>>> drug_effects = NormalDist(0.4, 0.15)
>>> combined = birth_weights + drug_effects
>>> round(combined.mean, 1)
3.1
>>> round(combined.stdev, 1)
0.5

New in version 3.8.版本3.8中新增。

NormalDist Examples and Recipes示例和配方

NormalDist readily solves classic probability problems.容易解决经典概率问题。

For example, given historical data for SAT exams showing that scores are normally distributed with a mean of 1060 and a standard deviation of 195, determine the percentage of students with test scores between 1100 and 1200, after rounding to the nearest whole number:例如,假设SAT考试的历史数据显示分数正态分布,平均值为1060,标准差为195,在四舍五入到最接近的整数后,确定考试分数在1100到1200之间的学生百分比:

>>> sat = NormalDist(1060, 195)
>>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
>>> round(fraction * 100.0, 1)
18.4

Find the quartiles and deciles for the SAT scores:找到SAT分数的四分位数十分位数

>>> list(map(round, sat.quantiles()))
[928, 1060, 1192]
>>> list(map(round, sat.quantiles(n=10)))
[810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]

To estimate the distribution for a model than isn’t easy to solve analytically, NormalDist can generate input samples for a Monte Carlo simulation:为了估计不容易解析求解的模型分布,NormalDist可以生成蒙特卡洛模拟的输入样本:

>>> def model(x, y, z):
... return (3*x + 7*x*y - 5*y) / (11 * z)
...
>>> n = 100_000
>>> X = NormalDist(10, 2.5).samples(n, seed=3652260728)
>>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471)
>>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453)
>>> quantiles(map(model, X, Y, Z))
[1.4591308524824727, 1.8035946855390597, 2.175091447274739]

Normal distributions can be used to approximate Binomial distributions when the sample size is large and when the probability of a successful trial is near 50%.当样本量较大且试验成功概率接近50%时,正态分布可用于近似二项式分布

For example, an open source conference has 750 attendees and two rooms with a 500 person capacity. 例如,一个开源会议有750名与会者和两个可容纳500人的会议室。There is a talk about Python and another about Ruby. 有一个关于Python的讨论,还有一个关于Ruby的讨论。In previous conferences, 65% of the attendees preferred to listen to Python talks. 在以前的会议中,65%的与会者更喜欢听Python演讲。Assuming the population preferences haven’t changed, what is the probability that the Python room will stay within its capacity limits?假设人口偏好没有改变,Python房间保持在其容量限制内的概率是多少?

>>> n = 750             # Sample size
>>> p = 0.65 # Preference for Python
>>> q = 1.0 - p # Preference for Ruby
>>> k = 500 # Room capacity
>>> # Approximation using the cumulative normal distribution
>>> from math import sqrt
>>> round(NormalDist(mu=n*p, sigma=sqrt(n*p*q)).cdf(k + 0.5), 4)
0.8402

>>> # Solution using the cumulative binomial distribution
>>> from math import comb, fsum
>>> round(fsum(comb(n, r) * p**r * q**(n-r) for r in range(k+1)), 4)
0.8402

>>> # Approximation using a simulation
>>> from random import seed, choices
>>> seed(8675309)
>>> def trial():
... return choices(('Python', 'Ruby'), (p, q), k=n).count('Python')
>>> mean(trial() <= k for i in range(10_000))
0.8398

Normal distributions commonly arise in machine learning problems.正态分布通常出现在机器学习问题中。

Wikipedia has a nice example of a Naive Bayesian Classifier. 维基百科有一个朴素贝叶斯分类器的好例子The challenge is to predict a person’s gender from measurements of normally distributed features including height, weight, and foot size.挑战在于通过测量正态分布特征(包括身高、体重和脚大小)来预测一个人的性别。

We’re given a training dataset with measurements for eight people. 我们得到了一个训练数据集,其中包含八个人的测量值。The measurements are assumed to be normally distributed, so we summarize the data with NormalDist:假设测量值为正态分布,因此我们用NormalDist总结数据:

>>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
>>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
>>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
>>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
>>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
>>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])

Next, we encounter a new person whose feature measurements are known but whose gender is unknown:接下来,我们遇到了一个新的人,其特征测量值已知,但其性别未知:

>>> ht = 6.0        # height
>>> wt = 130 # weight
>>> fs = 8 # foot size

Starting with a 50% prior probability of being male or female, we compute the posterior as the prior times the product of likelihoods for the feature measurements given the gender:从50%的男性或女性先验概率开始,我们计算后验概率作为先验时间,即给定性别的特征测量的似然度乘积:

>>> prior_male = 0.5
>>> prior_female = 0.5
>>> posterior_male = (prior_male * height_male.pdf(ht) *
... weight_male.pdf(wt) * foot_size_male.pdf(fs))
>>> posterior_female = (prior_female * height_female.pdf(ht) *
... weight_female.pdf(wt) * foot_size_female.pdf(fs))

The final prediction goes to the largest posterior. 最后的预测是最大的后验概率。This is known as the maximum a posteriori or MAP:这被称为最大后验概率或映射:

>>> 'male' if posterior_male > posterior_female else 'female'
'female'