
Descriptive Statistics
Whether calculations can be performed on distributions with open-ended classes depends on the specific calculation and the assumptions one is willing to make.
- Measures of Central Tendency:
- Mean: Calculating the mean is problematic with open-ended classes because the midpoint of the open interval is undefined. You would need to make an assumption about the value to assign to the open end, which introduces potential inaccuracies.
- Median: The median can often be determined even with an open-ended class, provided that the median falls within a closed interval. You need the cumulative frequency distribution to determine the interval containing the median.
- Mode: The mode can be identified if the modal class is a closed interval. If the open-ended class has the highest frequency, then the mode cannot be precisely determined.
- Measures of Dispersion:
- Range: The range cannot be determined with open-ended classes since the maximum value is undefined.
- Variance and Standard Deviation: These are generally difficult to calculate accurately with open-ended classes because they depend on knowing (or estimating) the values of individual data points or class midpoints. Estimating the open-ended interval's contribution to the variance can introduce significant error.
- Interquartile Range (IQR): The IQR can often be calculated, similar to the median, as long as the first and third quartiles fall within closed intervals.
In summary, while some calculations like the median and IQR *can* sometimes be performed with open-ended classes (provided the relevant quartiles fall within defined intervals), other calculations like the mean, standard deviation, and range are problematic and require assumptions or estimations that can impact accuracy. The feasibility depends heavily on the specific dataset and the desired level of precision.
The field of statistics involves several key elements that are fundamental to its application. Here's a breakdown of these elements:
-
Data:
This is the raw material of statistics. Data are collections of facts, figures, or other information, which can be numerical or non-numerical. Data can be collected through various methods such as surveys, experiments, or observations.
-
Population:
The entire group that is of interest in a study. It could be a group of people, objects, or events. Because studying an entire population is often impractical, statisticians often work with samples.
-
Sample:
A subset of the population that is selected for study. The sample should be representative of the population so that inferences made from the sample can be generalized to the entire population.
-
Variable:
A characteristic or attribute that can vary among individuals in a population or sample. Variables can be quantitative (numerical) or qualitative (categorical).
-
Parameter:
A numerical value that summarizes some aspect of the population. Because parameters are usually unknown, they are estimated from sample data.
-
Statistic:
A numerical value that summarizes some aspect of the sample. Statistics are used to estimate population parameters.
-
Statistical Inference:
The process of drawing conclusions about a population based on information obtained from a sample. This involves using statistical methods to estimate parameters, test hypotheses, and make predictions.
-
Hypothesis Testing:
A formal procedure for testing a claim or hypothesis about a population. It involves setting up a null hypothesis (a statement of no effect or no difference) and an alternative hypothesis (a statement that contradicts the null hypothesis), then using sample data to determine whether there is enough evidence to reject the null hypothesis.
-
Probability:
A measure of the likelihood that an event will occur. Probability plays a crucial role in statistical inference, as it allows statisticians to quantify the uncertainty associated with their conclusions.
Here's the breakdown of the statistical calculations for the dissolved oxygen levels in the Mumbai lakes:
Data: 5.4, 5.0, 8.1, 5.5, 6.5, 5.5
1. Mean (Average):
To calculate the mean, we sum all the values and divide by the number of values:
Mean = (5.4 + 5.0 + 8.1 + 5.5 + 6.5 + 5.5) / 6 = 36 / 6 = 6.0
2. Standard Deviation:
Standard deviation measures the spread of the data around the mean.
- Calculate the difference between each value and the mean.
- Square each of these differences.
- Find the average of these squared differences (this is the variance).
- Take the square root of the variance to get the standard deviation.
Calculations:
- (5.4 - 6.0)² = 0.36
- (5.0 - 6.0)² = 1.00
- (8.1 - 6.0)² = 4.41
- (5.5 - 6.0)² = 0.25
- (6.5 - 6.0)² = 0.25
- (5.5 - 6.0)² = 0.25
Variance = (0.36 + 1.00 + 4.41 + 0.25 + 0.25 + 0.25) / (6 - 1) = 6.52 / 5 = 1.304
Standard Deviation = √1.304 ≈ 1.142
3. Variance:
Variance is the average of the squared differences from the mean, as calculated in the standard deviation process.
Variance ≈ 1.304
4. Coefficient of Variation (CV):
The coefficient of variation is a normalized measure of dispersion of a probability distribution or frequency distribution. It is often expressed as a percentage and is defined as the ratio of the standard deviation to the mean.
CV = (Standard Deviation / Mean) * 100
CV = (1.142 / 6.0) * 100 ≈ 19.03%
5. Skewness:
Skewness measures the asymmetry of the data distribution. A skewness of 0 indicates a perfectly symmetrical distribution.
Formula for sample skewness: ∑[(xi - mean) / s]^3 / n
- (5.4 - 6.0) / 1.142 ≈ -0.525
- (5.0 - 6.0) / 1.142 ≈ -0.876
- (8.1 - 6.0) / 1.142 ≈ 1.839
- (5.5 - 6.0) / 1.142 ≈ -0.438
- (6.5 - 6.0) / 1.142 ≈ 0.438
- (5.5 - 6.0) / 1.142 ≈ -0.438
- (-0.525)^3 ≈ -0.145
- (-0.876)^3 ≈ -0.672
- (1.839)^3 ≈ 6.213
- (-0.438)^3 ≈ -0.084
- (0.438)^3 ≈ 0.084
- (-0.438)^3 ≈ -0.084
Sum of cubed values ≈ 5.312
Skewness = 5.312 / 6 ≈ 0.885
Since the value is positive, the data is positively skewed.
6. Kurtosis:
Kurtosis measures the "tailedness" of the distribution. Higher kurtosis indicates more data concentrated in the tails (and thus fewer in the shoulders).
Formula for sample kurtosis: ∑[(xi - mean) / s]^4 / n - 3
- (5.4 - 6.0) / 1.142 ≈ -0.525
- (5.0 - 6.0) / 1.142 ≈ -0.876
- (8.1 - 6.0) / 1.142 ≈ 1.839
- (5.5 - 6.0) / 1.142 ≈ -0.438
- (6.5 - 6.0) / 1.142 ≈ 0.438
- (5.5 - 6.0) / 1.142 ≈ -0.438
- (-0.525)^4 ≈ 0.076
- (-0.876)^4 ≈ 0.590
- (1.839)^4 ≈ 11.421
- (-0.438)^4 ≈ 0.037
- (0.438)^4 ≈ 0.037
- (-0.438)^4 ≈ 0.037
Sum of the values ≈ 12.198
Kurtosis = 12.198 / 6 - 3 ≈ -0.967
Since the kurtosis is less than 0, the distribution is platykurtic (flatter tails than a normal distribution).
Summary:
- Mean: 6.0 mg/L
- Standard Deviation: ≈ 1.142 mg/L
- Variance: ≈ 1.304 (mg/L)²
- Coefficient of Variation: ≈ 19.03%
- Skewness: ≈ 0.885 (Positive Skew)
- Kurtosis: ≈ -0.967 (Platykurtic)