## Investigations and Sampling

The use of statistics involves the gathering of data or information.
This data is then processed, organised and displayed.
Finally the data is presented, discussed and conclusions are made.

Many people require statistics for a wide range of purposes.

• A school is keen to know whether boys or girls do best in certain subjects.
• Doctors may need to know what causes a particular illness or disease.
• A conservationist wants to find out how many different types of whales there are in a certain location.

You may be required to carry out a statistical investigation as part of your course.

Such an investigation would cover the following:

### Designing a Statistical Survey

The first task is to plan the investigation. This will require thinking about lots of issues:

 What are the aims of your investigation? What questions is your investigation trying to answer? How are you going to display the results? Who are you surveying? How will you collect the data? Will your sample be random? How big will your samples be? Who will be using your final report? Will your raw data be easy to process, display and analyse?

All of these questions have to be answered before starting an investigation.

The collection of data takes time and is therefore expensive.

A balance must be found between having a sample large enough to be representative and the excessive cost of taking a large sample.

### Target Population

population in statistics is all of the group being considered in an investigation or survey.

e.g. The secondary schools in a city.
The males of a town
The cans of coke produced by a factory

Note that in statistics the term population does not necessarily mean the entire population of a town or country, it refers to the overall group being studied. The people who are being surveyed can be a small group, for example, students at a school or a large group such as all of the people in a state or province. The group being investigated is called the target population.

Once the target population has been decided the next task is to try to make sure that the results of the survey are as representative as possible of this population.

The sampling frame is a list of every item / person / object in the target population and is the means of obtaining the sample from the population. e.g. When taking a sample of students from a school the sampling frame would be a list of the students' names.

### Collecting the data

Two ways that data can be collected are by a questionnaire or an interview.

questionnaire is a form with questions designed to obtain information. Careful preparation of questionnaires is essential and may require special training. Questions which are hard to understand, are ambiguous or lead the respondent to give a particular answer must be avoided. Questionnaires that are given or posted to people often result in low return rates.

An interview could be carried out by stopping people in the street or ringing people to ask the questions. Problems with this form of data collection include resistance by people to give up time to answer questions and difficulty in obtaining random samples.

If questionnaires and interviews are not carefully designed and administered they can often produce biased samples.

A census is a investigation in which every member of the target population is surveyed. They are very expensive and take a long time to plan, organise and analyse the results.

The Australian and New Zealand Censuses, which collects information about the entire population of the country is held every five years and thousands of people are employed to collect and analyse the data. Both the New Zealand and Australian censuses were last held in 2001. Governments use a census to help plan for the future.

### Bias

Bias occurs when a sample which is not representative of the population is chosen. If every member of the population does not have an equal chance of being selected the sample will be biased.

Sources of bias include:

• A sampling frame which does not include all of the population.
• e.g. A telephone directory when sampling all of a town's population
• Poorly worded, ambiguous or leading questions in an interview or questionnaire
• e.g. A survey asking 14 year olds when they last had an alcoholic drink may not produce truthful responses.
• Bias by the person conducting the survey
• e.g. An interviewer selecting people in the street may avoid certain types of people or may have an attitude or manner which influences the responses.

### Sampling

A sample is obtained when only part of the population is surveyed. Most statistical investigations do not set out to obtain data about every item in a population, as in a census, but rely on a sample from the population.

Examples Every person in an electorate cannot be asked how they intend to vote before an election.

A wine company could not taste every single bottle of wine!

Taking a sample is usually much cheaper, quicker and more convenient than a census.

Random Samples

The aim when taking a sample is to obtain information which is representative of the whole population. If it is not, then the sample could be biased.

Choosing a sample

• The first thing to decide is the size of the sample. This would need to be large enough to be truly representative but not too large as this would be too expensive and time-consuming.
• A sample should be evenly spread over the population.
• A sample should be as random as possible. In a random sample every member of the target population has an equal chance of being chosen. A calculator or a spreadsheet can produce random numbers and there are tables of random numbers.

There are several ways to obtain a random sample:

Draw names out of a hat or balls out of a barrel, like a Lotto draw.

Give every person or item a number and choose the numbers at random, using special tables, a computer or a spreadsheet.

Example In a school of 900 students, a sample of 20 has to be chosen. Allocate a number from 1 to 900 to each student from a list of students.

Random number tables These contain strings of random digits. Start anywhere and select groups of three. If the number chosen is above 900 discard it.

Calculators Most calculators have a RND# button. When pressed it results in a three digit decimal number such as 0.439. Multiplying these numbers by 1000 will produce numbers from 1 to 999. Again, discard numbers over 900.

Spreadsheets A spreadsheet uses a function such as RAND( ), where a number is placed in the brackets and random numbers from 1 up to that number are given.

i.e. RAND(900) produces random numbers between 1 and 900.

Systematic Samples

Choosing names at regular intervals, say every 100 names, from an alphabetical list such as a telephone book is quite representative of the people living in a certain area. However, it is not truly random as some people do not own phones, but it is easy to carry out.

If the population being considered was the students of a school, then a list of all of the students in the school could be used to produce a systematic random sample by selecting students at regular intervals from the list.

Stratified and Cluster Sampling

Sometimes, it is desirable to ensure that there is sufficient representation in a sample from different groups within a population.

Example In a school there may be a need to get representation across different form levels. The number of students chosen from each level should be proportional to the total number of students at that level.

e.g. In a school of 1000, there are 220 Year 9 students, in a stratified sample of 50 for the whole school, how many Year 9 students should there be?

Example When a newspaper carries out an opinion poll, they often ask people from the major towns and cities of a country. In New Zealand they would use the four biggest cities, Auckland, Hamilton, Wellington and Christchurch. In Australia they might use the capital cities of each of the states. This is an example of cluster sampling. This is obviously not very random as it does not consider all of the people living in smaller towns and cities or those living in rural areas. The number sampled in each city would be proportional to the population of that city and worked out in a manner similar to the example above.

Random methods of selecting the sample should still be used within each strata or cluster.

### Organising and analysing the data

When the data has been collected, it is often summarised into table and graphical form.

Data can also be sorted using a stem and leaf diagram or using tallies in a frequency distribution and then displayed in a histogram or cumulative frequency graph. Statistical calculations can then be carried out to find information such as the mean, median and quartiles and these can then be shown on a box and whisker diagram. Calculators and computers can be used for calculating more complex statistics such as standard deviation.

Modern graphical calculators can do all of the above tasks.

These graphs and statistics are studied in more detail in other topics in this course.

### Displaying and Reporting the Investigation

Finally, and most importantly, the data, tables, graphs and statistics can be analysed and presented in a report. Pictographs, column and pie graphs are common ways of displaying the data along with those mentioned above.

Along with the data and analysis there should be an introduction listing the objectives of the investigation , a summary of the findings and conclusions, possibly with recommendations.