Data analysis in environmental health
The purpose of this tutorial is to focus on the principles and processes of analysis of environmental health data. We shall discuss principles of data analysis in general, and environmental health data in particular. The data analysis steps will be focused on analysing categorical variables (see below).
Different types of Data
Constants and Variables
- Constants. – Some entities have fixed values. These numbers do not change or are accepted as conventional. Example: pi (22/7), e (= 2.713), g (= 9 cm/s^2) and others. All else that change values are termed as variables.
For most entities and for most measurement values change over time, and space and under different circumstances. These measurements are referred to as variables. For example, quality of ambient air can be measured using concentration of some particulate matters that drift from above to the ground at certain speed levels. These are known as PM10. As PM10 concentrations vary over time, air quality is a variable. Thus PM10 is a variable too. Every variable has “levels”; each level contains values in the form of numbers or texts, or numbers as texts.
In categorical variables, each level is distinct from each other.
* Nominal. – These have individual levels that are “names”. Each level is distinct and is equal to each other. Examples include “name” variable, identity codes (ID), gender (“male”, “female”), level of contamination of a site (“contaminated”, “not contaminated”)
* Ordinal. – The levels are distinct from each other and are rank ordered (“good”, “better”, “best”), Examples include level of contamination of a site (“not contaminated” , “some contamination”, “heavily contaminated”).
Continuous. – Here, individual levels as numbers are on a continuum.
- Interval. – This is a continuous linear sequence of numbers. For interval variables, no concept of an absolute zero exists. Absolute zero means that a value of 0 assigned to a measurement indicates that this entity does not exist. An example of interval variable is heat measured in centigrades. When temperature or heat content is measured using centigrade or Fahrenheit, a value of zero degree centigrade or fahrenheit indicates a certain amount of heat content, but not absence of heat. A temperature of 5 degree centigrade is 4 points more than 1 degree. It does not mean that heat content of 5 degrees centigrade is 5 times that of an object with temperature of 1 degree centigrade, There is no notion of an absolute zero reference point when heat is measured in centigrade scale.
- Ratio. – Ratio measures are used where absolute zero exists. To continue with the above example, zero degree absolute or Kelvin scale temperature indicates absence of heat (this is a theoretical state). Here, an object with five Kelvin contains 5 times more heat an object with one Kelvin temperature. An example in health care is prevalence of a disease. Zero Prevalence indicates absence of the disease in the population. For two diseases, if one has prevalence of 100 per 1000, and another has 50 per 1000 population, then the first disease is twice as prevalent as the second.
Concept of Summary Measures.
– We use summary measures to describe continuous variables. We use frequency distributions to describe categorical variables. Central tendency and dispersion are two summary measures.
Concepts of Central Tendency and Dispersion
For continuous variables, the summary measures include mean, median, and mode (the three values that indicate their centre points, and the values that occur most of the times in the distribution). The summary measures of dispersion include variance and standard deviation (standard deviation is square root of variance) and standard error of the mean (this is standard deviation divided by the square root of the total number of individuals in the sample).
For details of online calculations, see
Distributions are probabilities of finding a specific value of a particular variable based on assumed patterns. In environmental health, we define three distribution patterns.
- Normal or Gaussian Distribution. – This is relevant for interval or ratio measures. In normal distribution, 68% of the values spread about one standard deviation about the . About 95% of the values lie at 1.96 times the standard deviation about the mean. Say systolic blood pressure follows normal distribution with 110 mmHg as mean and 30 mmHg as standard deviation. This is further based on say measurements on 100 individuals. The statistical notation is Blood Pressure ~ N(110, 30). From this we know that (1) the mean blood pressure is 110 mm Hg, (2) 68% of the individuals will have a systolic blood pressure between 80 and 140, and (3) 95% of individuals will have systolic blood pressure between 51 and 169 mmHg. The standard error about the mean would be measured as 30/sqrt(100) = 3. Refer to the following webpage to learn more about normal distribution.
- Binomial Distribution. – Binomial distribution is when a variable assumes one of two values (True and False, Yes and No). Take the example of hypertension. If we measure hypertension on the basis of a certain cut-off value for systolic blood pressure, then hypertension becomes a binary variable. Prevalence of hypertension will be a proportion. The mean for this distribution will be given proportion (p) and the standard deviation will be sqrt(p * (1 – p)). In this sense, prevalence of a disease follows binary distribution. For more information on binary distribution, see
- Poisson Distribution. – Poisson distribution applies to count of events. For example, incidence density rates are in the form of number of events over a certain person-year. Note that one person followed up for one year forms one person-year. Incidence density rates follow poisson distribution. For poisson distribution, the mean is expressed in the form of x/N where x = number of events and N = base population. In poisson distribution, the mean and the standard deviation are same. For more information, see
Data preprocessing is the first step of data analysis. Specifically, it refers to the following steps:
- Read a data base in a statistical software for processing.
- Test for the presence of missing values, extreme values (outliers), and other inconsistencies.
- Address these issues.
Use the following steps to preprocess data:
- Tabulate Frequency distribution. – Test one variable at a time to tabulate and visualise.
- Find out Outliers. – Outliers are extreme values. They are away from 1.5 times interquartile range in either direction. The interquartile range refers to the length between values that represent 25th and 75th percentiles in rank orders.
- Examine Box plots. – Box plots are graphical summaries of continuous variables. Here, boundary of boxes consist of values that represent 25th and the 75th percentiles. Two whiskers project from each end of the box at 1.5 times the interquartile range. Any value that lies beyond the whiskers is an outlier.
Identify missing values and outliers. Where possible, find out why some values are outliers and some values are missing. Wherever possible, correct them. You may need to refer to the original data sources and correct them by hand.
Theories guide our research questions. This happens when we aim to find association between two variables. For example, we have a theory that climate change causes respiratory diseases. We start with hypotheses based on our theories. Hypotheses are null (equality) and alternative (newness). For example, based on the theory that climate change causes respiratory diseases, we can frame the following hypotheses:
- The null hypothesis would state that prevalence of respiratory disease are same for countries with and without high per capita carbon dioxide emission (a marker for climate change)
- The alternative hypothesis would state that for countries with high carbon dioxide emission, respiratory disease prevalence would be high.
We then collect data to test if the null hypothesis is false. If our data suggest that the null hypothesis is false, we reject the null. However, in doing that we are open to two types of errors:
- We may falsely reject the null hypothesis (alpha error)
- We may falsely fail to reject the null hypothesis (beta error)
Read more about hypothesis testing here. In planning our studies, we set the probability limits of these two errors beforehand. Usually, we set the probability of alpha error at five percent, and set the probability of beta errors at 20%.
Alpha and beta errors
|Reject or Not||Null Hypothesis True||Null Hypothesis False|
|Reject Null||Falsely Reject Null (alpha error)||Correctly reject null|
|Fail to Reject Null||Correct||Falsely Fail to Reject Null (beta error)|
We fix alpha and the beta errors when the study commences. These in turn help to estimate the sample size for the study. The power of a study is (1- beta error). For sample size estimation, we need to fix the values of alpha error, the beta error (or power) and the desired effect estimate. are established on that basis. For more information on sample size estimation, read
P-values and 95 % confidence intervals
P-values in a study refer to a probability estimate that the observed results are admissible under conditions of null. If this value is very low (say 5% or less), it means that the probability that null hypothesis can explain these findings is very low. Correspondingly, a high p-value indicates that it is highly possible that these findings support the null hypothesis. The higher the p-value, the stronger the effects of the null. Note that p-value is a composite of sample size and effect size. If the sample size is very large, even small effect sizes can return low p-values. Likewise, with small samples, only large effect sizes can yield low p-values. Therefore, a single p-value cannot isolate these two pieces of information. A better approach is to estimate a 95% confidence interval estimate. This provides information on the point estimate, and the range within which the true estimate lies.
Let us put these ideas to a real world data analysis. We shall use the R software to conduct an actual data analysis.
Use of R to illustrate how to analyse data
We shall use a data set on environmental health. In this paper, we are going to examine the theory that carbon emissions are linked with tuberculosis. From this theory, we derive the hypotheses that high per capita carbon emission will be associated with high prevalence of tuberculosis (the null hypothesis will be that, high per capita carbon emission and low per capita carbon emission will have similar levels of tuberculosis as carbon emission and tuberculosis are not related to each other).
To test the hypotheses, we shall download data from the World Health Organisation about tb prevalence in different countries, and we shall obtain data on carbon emissions in kilogram per year per person (per capita carbon dioxide emission) from the World Bank databases. Then we shall link the two data sets. After linking the two data sets, we shall examine them graphically and run preliminary analyses.
We shall also use R for these steps. Please follow along using the steps with your own installation of R software. The R code is provided at the end of this lesson.
Introduction to R
R is a statistical data analysis and programming environment. In this sense R is quite powerful as it lets you develop new functions and you can write new programmes in R as well. R is a programming language and you can learn R easily. For more information, see the R web site
Obtain and install R
- To get started, download R software package from the following webpage
Select your specific operating system and get started. Once you install R, visit the R console and type
help.start()to get started. Read the user manual as best as you can. If you have questions, you can use the help in different ways:
- You can type: “??(“the question itself”)
- If you know what topic you are interested to learn more about, type help(“the topic”); for example, if you want to learn more about linear model (lm), type help(“lm”)
- You can ‘google’ about the specific problem you are searching about
- You can join the R help discussion group and post your question there.
Concept of objects in R
R works on the principles of objects and functions. Objects and functions are the structural and functional building blocks of R. Functions and data together make up packages that R uses for data analysis. Objects are where R stores data. Four different types of objects are vectors (single sequence of data of any single type), matrices (two dimensional matrix of data), lists (different combinations of data types strung together), or data frames (data of different types of lists put together). To find out objects within an R environment, type
ls() and R will return a list of objects. Every object have its own properties, and you can manipulate objects in different ways.
The steps of data analysis
- Visit the WHO gold global health observatory data bank
- Select a theme, here we select “tuberculosis” (see the screenshot)
- Then we select the link no on “download more tb data” (see the screenshot)
- We download and store the data set at a folder (this folder will be where we shall keep all our analyses and then will release the analyses)
- Next, visit, the World Bank web page
- From the world bank web page on data set, data sets, we select data on carbon dioxide emissions, here, see screenshot
Here is the URL
(You will need to create a an account to download the data sets)
- Download the data sets to the same folder where you will be working on R as well
8 Now you see that there are data sets in comma separated value formats the same folders
The Cleaning process
See the screenshots of the data sets
We merge the two data sets
- First, create a script in R and save it as practice.R in the folder
- Next, read the csv file into R
- Next, meet merge the two data sets to create a composite data set
As for data preprocessing 9 aka data cleaning or data munging,
- We first po plot the two variables, we see
- Next we see that those countries that have high prevalence also have low emissions and those countries that have high emission per capita have low tb prevalence. Let us further explore these associations. The next step in the exploratory data analysis is to construct bar plots. We divide the per capita carbon emission into three groups – less than 5 kilograms, between 5-9 kilograms, and 10 kilograms or above. Let’s see the average of prevalence in tuberculosis in the three groups of countries with different amounts of per capita emissions.
- We do a one way analysis of variance or ANOVA to explore this further.
This way you can explore relationships of different variables. You can obtain environmental data and disease or health related data from World Bank and WHO. You can link data across web sites. You can also run simple analyses on environmental health.
Can you think of a few reasons why tuberculosis rates are high for countries with low per capita emission rates? How will you test your theory and the resulting hypotheses?
Run these analyses and modify yourself using the R code posted here:
# This file will host the codes or ror the or for the practice session # # First read the csv files # first, set the wordk working directory # You set your own directories (do not use mine) setwd("/Users/arindambose/Dropbox/2015-03-30-teaching/hlth214/dataanalysis/practice") # Next, read the csv files # emission <- read.csv("carbonemission.csv", header = TRUE, sep = "," , na.strings = "..") tbdata <- read.csv("tbbod3.csv", header = TRUE, sep = ",") ozone <- read.csv("ozone.csv", header = TRUE, sep = ",", na.strings = "NA") # find ou th out the names of the variables # # change the names o of thevaribles in emission to "country" and "emission" names(emission) <- c("country", "emission") print(c(names(emission), names(tbdata))) ## change teh names of the variables #now, merge the two data sets mydata <- merge(emission, tbdata, by = "country") names(mydata) <- c("country", "emission", "iso3", "year", "population", "prevalence", "incidence") print(names(mydata)) # plot per capita carbon emission and prevalence of tuberculosis # Histogram of prevalence of tb plot(mydata$prevalence) plot(mydata$emission, mydata$prevalence, main = "Carbon Emission versus Prevalence of tuberculosis", xlab = "Percapita carbon emission", ylab = "Prevalence of tb") # this plot shows that countries with per capita low carbon emission are also countries that have high prevalence of tb # divide the per capita carbon emission into three groups, low medium high, less than 5, 5-9 and 10 or above and then see how tb prevalence are in those countries # mydata$em3 <- cut(mydata$emission, br = c(0,5,9,25)) levels(mydata$em3) <- c("lt 5", "5to9", "gt 10") plot(mydata$em3, mydata$prevalence, main = "Carbon Emission and Prevalence of tuberculosis", xlab = "Levels of carbon emission", ylab = "Prevalence of tuberculosis") # barplot(mydata$prevalence, mydata$em3) # prevbym3 <- tapply(mydata$prevalence, mydata$em3, mean) barplot(prevbym3, main = "Prevalence of tb by emission levels", xlab = "Levels of emission", ylab = "Prevalence of tb") print(table(mydata$em3)) prevm3 <- aov(prevalence ~ em3, data = mydata) print(summary(prevm3)) print(pairwise.t.test(mydata$prevalence, mydata$em3, p.adjust.method = "none"))