Classification of Data in Statistics
Rules of Classification
The principal rules of classifying data are:
- To condense the mass of data in such a way that salient features can be readily noticed; for example, household incomes can be grouped as higher income group, middle-income group, and lower income group based on certain criteria.
- To facilitate comparison between attributes of variables; for example, comparison between education and income, income and expenditure on consumer durables, etc.
- To prepare data for tabulation.
- To highlight the significant features; for example, data is concentrated on one side, or one particular value may be dominant.
- To enable a grasp of data.
- To study the relationship.
Table of Contents
- 1 Classification of Data in Statistics
Bases of Classification
Some common types of bases of classification are:
- Geographical classification: In this type, the data is classified according to area or region; for example, state-wise industrial production, city-wise consumer behavior, area-wise sales figures, etc.
- Chronological classification: In this type, the data is classified according to the time of its occurrence; for example, monthly sales, yearly production, daily demands, etc.
- Qualitative classification: When the data is classified according to some attributes, which are not capable of measurement, is known as qualitative classification.
In dichotomous classification, an attribute is divided into two classes, one possessing the attribute and the other not possessing it; for example, sex, smoker, nonsmoker, employed, unemployed, etc.
In many-fold classification, the attribute is divided so as to form several classes; for example, education level, religion, mother tongue, etc.
- Classification of data according to some characteristics: It refers to the classification of data according to some characteristics that can be measured; for example, salary, age, height, etc.
Quantitative data may be further classified into one or two types, discrete and continuous. In the case of discrete type, values the variable can take are countable (could be infinitely large also, for example, integers).
Examples of these are the number of accidents, the number of defectives, etc. In the case of continuous quantities, data can take any real values; for example, weight, distance, volume, etc.
There are two kinds of frequency distributions, namely, discrete frequency distribution (or simple, or ungrouped frequency distribution), and continuous frequency distribution (or condensed or grouped frequency distribution).
Discrete Frequency Distribution
The process of preparing discrete frequency distribution is simple. First, all possible values of variables are arranged in ascending order in a column. Then, another column of the ‘Tally’ mark is prepared to count the number of times a particular value of the variable is repeated.
To facilitate counting, a block of five ‘Tally’ marks is prepared. The last column contains frequency. To illustrate this let us consider one example.
Example: Construct a frequency distribution table for the following data of the number of family members in 30 families:
Solution: The discrete frequency distribution with the help of a tally mark is shown below:
|Number of Families Members||Tally Marks||Frequency|
|Total N = 30|
Continuous Frequency Distribution
For continuous data a ‘grouped frequency distribution’ is necessary. For discrete data, the discrete frequency distribution is better than an array, but this does not condense the data. ‘Grouped frequency distribution’ is useful for condensing discrete data by putting them into smaller groups or classes called class intervals.
Some important terms used in the case of the continuous frequency distribution are as follows:
- Class limits: Class limits denote the lowest and highest values that can be included in the class. The two boundaries of class are known as the lower limit and upper limit of the class. For example, 10-19.5, 20-29.5, where 10 and 19.5 are limits of the first class; 20 and 29.5 are limits of the second class, etc.
- Class intervals: The class interval represents the width (span or size) of a class. The width may be determined by subtracting the lower limit of one class from the lower limit of the following class. For example, classes 10-20, 20-30, etc. have class intervals 20–10 = 10.
- Class frequency: The number of observations falling within a particular class is called it’s class frequency. Total frequency indicates the total number of observations N =Σ f.
- Classmark or class mid-point: Mid-point of a class is defined as the sum of two successive lower limits divided by 2. Thus class mark is the value lying halfway between lower and upper-class limits. For example, classes 10-20, 20-30, etc. have class marks 15, 25, etc.
- Types of class intervals: There are different ways in which the limits of class intervals can be shown.
- Exclusive Method: The class intervals are so arranged that the upper limit of one class is the lower limit of the next class. This method always presumes that the upper limit is excluded from the class, for example, with class limits 20-25, 25-30 observation with value 25 is included in class 25-30.
- Inclusive method: In this method, the upper limit of the class is included in that class itself. In such cases, there is no overlap of the upper limit of the former class and the lower limit of the successive classes. For example, with class limits 20-29.5, 30-39.5, 40-49.5, etc. there is no ambiguity but values from 29.5 to 30 or 39.5 to 40, etc. are not allowed.
- Open end: In an open-end distribution, the lower limit of the very first class and/or upper limit of the last class is not given.
For example, while stating the distribution of monthly salary of managers in rupees, one may specify class limits as, below 15000, 15000-25000, 25000-35000, 35000-45000, and above 45000.
Similarly, while recording weights of college students in kg as grouped data the class intervals could be less than 50, 50 to 60, 60 to 70, 70 to 80, 80 to 90, and greater than 90.
- Unequal class interval: This is another method to limit the class intervals where the width of the classes is not equal for all classes. This method is of practical use when there are large gaps in the data, or the distribution of the data is uneven.
It is used for explaining, visualizing, and plotting data with unequal class intervals. However, we must adjust formulae for calculations accordingly.
Guideline for Choosing the Class
- The number of classes should not be too small or too large, preferably between 5 and 15.
- If possible, the widths of the intervals should be numerically simple like 5, 10, 15, etc.
- It is desirable to have classes of equal width.
- Starting point of the class should begin with 0, 5, 10, or multiple thereof.
- The class interval should be determined based on maximum values and the number of classes to be formed.
All the above points can be explained with the help of the following example.
Example: Ages of 50 employees are given:
Prepare a frequency distribution table.
Solution: A frequency distribution table is prepared as follows:
- First, find the highest and lowest values. These are 65 and 21 respectively. Thus, the difference is 44.
- Since the total observations are 50 we decide to select 5 classes.
- The approximate class interval works out to be (65-21)/5 = 8.8. Hence, we select the class interval as 10.
- As our lowest value is 21, we start from the lower class limit of the first class as 20. We use the exclusive method of class interval.
- We then decide on class intervals as 20-30, 30-40, 40-50, 50-60, and 60-70.
- Then, each observation is checked for the class interval in which it lies. For each observation, we make a tally mark against the corresponding class interval. As per the convention, every fifth tally is put horizontally across. This helps with quick counting.
The frequency distribution is given below:
|Class Interval||Class Mark||Tally||Frequency|
|Total = 50|
Cumulative and Relative Frequency
In many situations rather than listing the actual frequency opposite each class, it may be appropriate to list either cumulative frequencies or relative frequencies, or both.
The cumulative frequency of a given class interval thus represents the total of all the previous class frequencies including the class against which it is written.
If we multiply the relative frequency by 100, we get the percentage frequency
There are two important advantages to looking at relative frequencies (percentages) instead of absolute frequencies in a frequency distribution. These are:
- Relative frequencies facilitate the comparison of two or more sets of data.
- Relative frequencies constitute the basis of understanding the concept of probability.
To explain the cumulative and relative frequencies we work these on our earlier problem.
Example: Ages of 50 employees are given:
Find cumulative frequency, relative frequency, and percentage frequency.
|20-30||7||(0+7) = 7||7/50 = 0.14||14|
|30-40||16||(7+16) = 23||16/50 = 0.32||32|
|40-50||15||(23+15) = 38||15/50 = 0.30||30|
|50-60||9||(38+9) = 47||9/50 = 0.18||18|
|60-70||3||(47+3) = 50||3/50 = 0.06||6|
|N = ∑f = 50||Total = 1||Total = 100|