Data Analysis in Environmental Health

Data analysis in environmental health

The purpose of this tutorial is to focus on the principles and processes of analysis of environmental health data. We shall discuss principles of data analysis in general, and environmental health data in particular. The data analysis steps will be focused on analysing categorical variables (see below).

Different types of Data

Constants and Variables

  • Constants. – Some entities have fixed values. These numbers do not change or are accepted as conventional. Example: pi (22/7), e (= 2.713), g (= 9 cm/s^2) and others. All else that change values are termed as variables.


For most entities and for most measurement values change over time, and space and under different circumstances. These measurements are referred to as variables. For example, quality of ambient air can be measured using concentration of some particulate matters that drift from above to the ground at certain speed levels. These are known as PM10. As PM10 concentrations vary over time, air quality is a variable. Thus PM10 is a variable too. Every variable has “levels”; each level contains values in the form of numbers or texts, or numbers as texts.


In categorical variables, each level is distinct from each other.
* Nominal. – These have individual levels that are “names”. Each level is distinct and is equal to each other. Examples include “name” variable, identity codes (ID), gender (“male”, “female”), level of contamination of a site (“contaminated”, “not contaminated”)
* Ordinal. – The levels are distinct from each other and are rank ordered (“good”, “better”, “best”), Examples include level of contamination of a site (“not contaminated” , “some contamination”, “heavily contaminated”).

Continuous. – Here, individual levels as numbers are on a continuum.

  • Interval. – This is a continuous linear sequence of numbers. For interval variables, no concept of an absolute zero exists. Absolute zero means that a value of 0 assigned to a measurement indicates that this entity does not exist. An example of interval variable is heat measured in centigrades. When temperature or heat content is measured using centigrade or Fahrenheit, a value of zero degree centigrade or fahrenheit indicates a certain amount of heat content, but not absence of heat. A temperature of 5 degree centigrade is 4 points more than 1 degree. It does not mean that heat content of 5 degrees centigrade is 5 times that of an object with temperature of 1 degree centigrade, There is no notion of an absolute zero reference point when heat is measured in centigrade scale.
  • Ratio. – Ratio measures are used where absolute zero exists. To continue with the above example, zero degree absolute or Kelvin scale temperature indicates absence of heat (this is a theoretical state). Here, an object with five Kelvin contains 5 times more heat an object with one Kelvin temperature. An example in health care is prevalence of a disease. Zero Prevalence indicates absence of the disease in the population. For two diseases, if one has prevalence of 100 per 1000, and another has 50 per 1000 population, then the first disease is twice as prevalent as the second.

Concept of Summary Measures.

– We use summary measures to describe continuous variables. We use frequency distributions to describe categorical variables. Central tendency and dispersion are two summary measures.

Concepts of Central Tendency and Dispersion

For continuous variables, the summary measures include mean, median, and mode (the three values that indicate their centre points, and the values that occur most of the times in the distribution). The summary measures of dispersion include variance and standard deviation (standard deviation is square root of variance) and standard error of the mean (this is standard deviation divided by the square root of the total number of individuals in the sample).

For details of online calculations, see
Central Tendency


Distributions are probabilities of finding a specific value of a particular variable based on assumed patterns. In environmental health, we define three distribution patterns.

  • Normal or Gaussian Distribution. – This is relevant for interval or ratio measures. In normal distribution, 68% of the values spread about one standard deviation about the . About 95% of the values lie at 1.96 times the standard deviation about the mean. Say systolic blood pressure follows normal distribution with 110 mmHg as mean and 30 mmHg as standard deviation. This is further based on say measurements on 100 individuals. The statistical notation is Blood Pressure ~ N(110, 30). From this we know that (1) the mean blood pressure is 110 mm Hg, (2) 68% of the individuals will have a systolic blood pressure between 80 and 140, and (3) 95% of individuals will have systolic blood pressure between 51 and 169 mmHg. The standard error about the mean would be measured as 30/sqrt(100) = 3. Refer to the following webpage to learn more about normal distribution.
  • Binomial Distribution. – Binomial distribution is when a variable assumes one of two values (True and False, Yes and No). Take the example of hypertension. If we measure hypertension on the basis of a certain cut-off value for systolic blood pressure, then hypertension becomes a binary variable. Prevalence of hypertension will be a proportion. The mean for this distribution will be given proportion (p) and the standard deviation will be sqrt(p * (1 – p)). In this sense, prevalence of a disease follows binary distribution. For more information on binary distribution, see
  • Poisson Distribution. – Poisson distribution applies to count of events. For example, incidence density rates are in the form of number of events over a certain person-year. Note that one person followed up for one year forms one person-year. Incidence density rates follow poisson distribution. For poisson distribution, the mean is expressed in the form of x/N where x = number of events and N = base population. In poisson distribution, the mean and the standard deviation are same. For more information, see

Data Preprocessing

Data preprocessing is the first step of data analysis. Specifically, it refers to the following steps:

  • Read a data base in a statistical software for processing.
  • Test for the presence of missing values, extreme values (outliers), and other inconsistencies.
  • Address these issues.

Use the following steps to preprocess data:

  • Tabulate Frequency distribution. – Test one variable at a time to tabulate and visualise.
  • Find out Outliers. – Outliers are extreme values. They are away from 1.5 times interquartile range in either direction. The interquartile range refers to the length between values that represent 25th and 75th percentiles in rank orders.
  • Examine Box plots. – Box plots are graphical summaries of continuous variables. Here, boundary of boxes consist of values that represent 25th and the 75th percentiles. Two whiskers project from each end of the box at 1.5 times the interquartile range. Any value that lies beyond the whiskers is an outlier.
    Identify missing values and outliers. Where possible, find out why some values are outliers and some values are missing. Wherever possible, correct them. You may need to refer to the original data sources and correct them by hand.

Hypothesis Testing

Theories guide our research questions. This happens when we aim to find association between two variables. For example, we have a theory that climate change causes respiratory diseases. We start with hypotheses based on our theories. Hypotheses are null (equality) and alternative (newness). For example, based on the theory that climate change causes respiratory diseases, we can frame the following hypotheses:

  • The null hypothesis would state that prevalence of respiratory disease are same for countries with and without high per capita carbon dioxide emission (a marker for climate change)
  • The alternative hypothesis would state that for countries with high carbon dioxide emission, respiratory disease prevalence would be high.

We then collect data to test if the null hypothesis is false. If our data suggest that the null hypothesis is false, we reject the null. However, in doing that we are open to two types of errors:

  • We may falsely reject the null hypothesis (alpha error)
  • We may falsely fail to reject the null hypothesis (beta error)

Read more about hypothesis testing here. In planning our studies, we set the probability limits of these two errors beforehand. Usually, we set the probability of alpha error at five percent, and set the probability of beta errors at 20%.

Alpha and beta errors

Reject or Not Null Hypothesis True Null Hypothesis False
Reject Null Falsely Reject Null (alpha error) Correctly reject null
Fail to Reject Null Correct Falsely Fail to Reject Null (beta error)

We fix alpha and the beta errors when the study commences. These in turn help to estimate the sample size for the study. The power of a study is (1- beta error). For sample size estimation, we need to fix the values of alpha error, the beta error (or power) and the desired effect estimate. are established on that basis. For more information on sample size estimation, read

P-values and 95 % confidence intervals

P-values in a study refer to a probability estimate that the observed results are admissible under conditions of null. If this value is very low (say 5% or less), it means that the probability that null hypothesis can explain these findings is very low. Correspondingly, a high p-value indicates that it is highly possible that these findings support the null hypothesis. The higher the p-value, the stronger the effects of the null. Note that p-value is a composite of sample size and effect size. If the sample size is very large, even small effect sizes can return low p-values. Likewise, with small samples, only large effect sizes can yield low p-values. Therefore, a single p-value cannot isolate these two pieces of information. A better approach is to estimate a 95% confidence interval estimate. This provides information on the point estimate, and the range within which the true estimate lies.

Let us put these ideas to a real world data analysis. We shall use the R software to conduct an actual data analysis.

Use of R to illustrate how to analyse data

We shall use a data set on environmental health. In this paper, we are going to examine the theory that carbon emissions are linked with tuberculosis. From this theory, we derive the hypotheses that high per capita carbon emission will be associated with high prevalence of tuberculosis (the null hypothesis will be that, high per capita carbon emission and low per capita carbon emission will have similar levels of tuberculosis as carbon emission and tuberculosis are not related to each other).

To test the hypotheses, we shall download data from the World Health Organisation about tb prevalence in different countries, and we shall obtain data on carbon emissions in kilogram per year per person (per capita carbon dioxide emission) from the World Bank databases. Then we shall link the two data sets. After linking the two data sets, we shall examine them graphically and run preliminary analyses.

We shall also use R for these steps. Please follow along using the steps with your own installation of R software. The R code is provided at the end of this lesson.

Introduction to R

R is a statistical data analysis and programming environment. In this sense R is quite powerful as it lets you develop new functions and you can write new programmes in R as well. R is a programming language and you can learn R easily. For more information, see the R web site

Obtain and install R

  1. To get started, download R software package from the following webpage
    Select your specific operating system and get started. Once you install R, visit the R console and type help.start() to get started. Read the user manual as best as you can. If you have questions, you can use the help in different ways:

    • You can type: “??(“the question itself”)
    • If you know what topic you are interested to learn more about, type help(“the topic”); for example, if you want to learn more about linear model (lm), type help(“lm”)
    • You can ‘google’ about the specific problem you are searching about
    • You can join the R help discussion group and post your question there.

Concept of objects in R

R works on the principles of objects and functions. Objects and functions are the structural and functional building blocks of R. Functions and data together make up packages that R uses for data analysis. Objects are where R stores data. Four different types of objects are vectors (single sequence of data of any single type), matrices (two dimensional matrix of data), lists (different combinations of data types strung together), or data frames (data of different types of lists put together). To find out objects within an R environment, type ls() and R will return a list of objects. Every object have its own properties, and you can manipulate objects in different ways.

The steps of data analysis

  1. Visit the WHO gold global health observatory data bank
  2. Select a theme, here we select “tuberculosis” (see the screenshot)
  3. Then we select the link no on “download more tb data” (see the screenshot)
  4. We download and store the data set at a folder (this folder will be where we shall keep all our analyses and then will release the analyses)
  5. Next, visit, the World Bank web page
  6. From the world bank web page on data set, data sets, we select data on carbon dioxide emissions, here, see screenshot
    Here is the URL
    (You will need to create a an account to download the data sets)
  7. Download the data sets to the same folder where you will be working on R as well
    8 Now you see that there are data sets in comma separated value formats the same folders
    (See screenshots)

The Cleaning process

See the screenshots of the data sets
World bank data
WHO data

We merge the two data sets

  1. First, create a script in R and save it as practice.R in the folder
  2. Next, read the csv file into R
  3. Next, meet merge the two data sets to create a composite data set

As for data preprocessing 9 aka data cleaning or data munging,

  1. We first po plot the two variables, we see

Scatter plot of tb with emissions

  1. Next we see that those countries that have high prevalence also have low emissions and those countries that have high emission per capita have low tb prevalence. Let us further explore these associations. The next step in the exploratory data analysis is to construct bar plots. We divide the per capita carbon emission into three groups – less than 5 kilograms, between 5-9 kilograms, and 10 kilograms or above. Let’s see the average of prevalence in tuberculosis in the three groups of countries with different amounts of per capita emissions.

Bar Plot of tb with emission levels

  1. We do a one way analysis of variance or ANOVA to explore this further.
    Results of ANOVA

This way you can explore relationships of different variables. You can obtain environmental data and disease or health related data from World Bank and WHO. You can link data across web sites. You can also run simple analyses on environmental health.

Next Steps

Can you think of a few reasons why tuberculosis rates are high for countries with low per capita emission rates? How will you test your theory and the resulting hypotheses?

Run these analyses and modify yourself using the R code posted here:

# This file will host the codes or ror the or for the practice session
# First read the csv files
# first, set the wordk working directory
# You set your own directories (do not use mine)

# Next, read the csv files
emission <- read.csv("carbonemission.csv", header = TRUE, sep = ","
                     , na.strings = "..")
tbdata <- read.csv("tbbod3.csv", header = TRUE, sep = ",")
ozone <- read.csv("ozone.csv", header = TRUE, sep = ",", na.strings = "NA")

# find ou th out the names of the variables

# change the names o of thevaribles in emission to "country" and "emission"

names(emission) <- c("country", "emission")
print(c(names(emission), names(tbdata)))

## change teh names of the variables

#now, merge the two data sets

mydata <- merge(emission, tbdata, by = "country")

names(mydata) <- c("country", "emission",
                   "iso3", "year",
                   "population", "prevalence",


# plot per capita carbon emission and prevalence of tuberculosis
# Histogram of prevalence of tb


plot(mydata$emission, mydata$prevalence,
     main = "Carbon Emission versus Prevalence of tuberculosis",
     xlab = "Percapita carbon emission",
     ylab = "Prevalence of tb")

# this plot shows that countries with per capita low carbon emission are also countries that have high prevalence of tb
# divide the per capita carbon emission into three groups, low medium high, less than 5, 5-9 and 10 or above and then see how tb prevalence are in those countries
mydata$em3 <- cut(mydata$emission, br = c(0,5,9,25))
levels(mydata$em3) <- c("lt 5", "5to9", "gt 10")

plot(mydata$em3, mydata$prevalence,
     main = "Carbon Emission and Prevalence of tuberculosis",
     xlab = "Levels of carbon emission",
     ylab = "Prevalence of tuberculosis")

# barplot(mydata$prevalence, mydata$em3)
prevbym3 <- tapply(mydata$prevalence, mydata$em3, mean)

        main = "Prevalence of tb by emission levels",
        xlab = "Levels of emission",
        ylab = "Prevalence of tb")


prevm3 <- aov(prevalence ~ em3, data = mydata)
print(pairwise.t.test(mydata$prevalence, mydata$em3, p.adjust.method = "none"))


Designing a Study in Environmental Health

Dr Dey was examining his patient’s hand intently. On that day, Ram Das, an agricultural labourer was his last referred patient in the clinic and he was referred to from the district medical officer with a puzzling lesion in his hands (see Figure 1).

This patient came to see him, like many others from a village directed by their doctors, about sixty kilometres away from Calcutta. The palms had dark raised lessons lesions and the doctor was intrigued. He diagnosed that the patient was exposed to arsenic and manifested signs of arsenic toxicity but he was puzzled.
“Where are you from?” he asked.
“From Nishipur, doctor shaheb,” Ram Das told the good doctor. Dr Dey knew about this village, where most people lived by agriculture, and people like Ram Das would not likely to work in copper smelters for there were none, and copper smelters were occupational sources of arsenic; he might be drinking inorganic arsenic contaminated water though, but he was not sure if or how that could be the source of arsenic. It could be also due to pesticides but he was not sure. So he asked him,
“What do you do for a living? From where do you get your drinking water?” this should have been obvious to him from the notes his assistants wrote anyway, but he wanted to confirm.
“I have a small plot of land that I farm sir; and we have a shallow tube well that we have dug from the community and everyone drinks water from the same well” Ram Das said.
This was puzzling. Dr Dey guessed that water from this tube well may have something to do with the skin disease that Ram reported with. But it might not be just one well. Ram was the fifth patient whom he saw this week with these painful lesions from different areas in South Gangetic Bengal from where Ram came, and these were all referred by doctors all over the district. The doctors were puzzled too; they knew that the most likely diagnosis of this was that the patients were exposed to arsenic from somewhere, but it was not entirely clear where. At which point, he remembered to ask,
“Who else you know have these lesions? I mean, are there others with similar skin disease in your locality?”
Ram told him that he know knew at least ten other people with very similar kind of skin disease and some of them have had their fingers and toes falling off; he knew others with cancer.
Dr Dey ordered a test for arsenic of the Ram’s urine sample and sent it to a laboratory in Calcutta. There results came back the following week. The results confirmed what he suspected all along: Ram had very high levels of arsenic in his urine. He and his fellow doctors were puzzled where it came from and what might be the story. Dr Dey called his friends at the Geological survey of India and planned a study. Perhaps the experts at the Geological survey group would know what was going on. The well water needed to be checked.

Professor Smith was tired that evening after his long research tour overseas; first in India where he had been to lead an investigation and subsequent publication of the report on arsenic toxin in West Bengal, and then, he was in Argentina investigating arsenic toxicity. The phone rang. It was a call from the Times from their New York office. They were wondering if the good professor could spend some time with them for an hour long interview on the Arsenic toxicity problem in India as it reached a fearsome proportion but also, some people were scared in Utah and California that they might end up drinking arsenic laced water from their ground water sources.

What shall we consider while setting up an environmental health study?

An study design refers to the layout and planning of a study. Environmental epidemiology in the context of Environmental health refers to the environmental distribution and determinants of diseases and in particular, the source of the exposure or the health effects have would have an element of human activity or human engendered. Therefore, a carefully conducted epidemiological study is important in exploring environmental health issues. Here, we review issues that determine the study designs, methods, and applications.

Who, What, When, Where, Why, How

Every health research question has “who”, “what”, “when”, “where”, “how”, and “why” elements. The “who” element refers to humans. Who are affected? What are their age, gender distribution, socioeconomic factors? Can a pattern be identified? “When” and “where” indicate time and spatial distribution of the diseases. “How” and “why” are questions about the mechanism of disease or disease causation.
In turn, these questions provide directions for health research. Some health studies are purely descriptive, others are in search of analyses of data. For example, when the question is what is the extent of air pollution in the city of Christchurch, then just analyses of air pollutants collected over a range of different stations would be sufficient 1. On the other hand, if the question is whether air pollution is associated with death among elderly, then another type of study is warranted. For example, Sadiva and colleagues used a time series analysis in Sao Pauolo, Brazil to study the linkage between air quality and deaths in elderly individuals by linking data on deaths from the different wards of the city and the air quality data from the monitoring stations 2.

Case series

These could be descriptive epidemiological studies (case reports, case series), or analytical epidemiological studies (some ecological studies, cross sectional surveys, case control studies, and cohort studies). While case studies or cross sectional surveys are well suited for description of health conditions, based on the results of these studies, scientists can come up with hypothesis generating questions.
Disease surveillance, for example are based on case series. (Provide example of a disease surveillance in environmental health condition). For example, continuous monitoring of air quality at a place can be an example of a an exposure surveillance. In New Zealand, the Environmental Science and Research (ESR) routinely conduct environmental and disease surveillance for a range of diseases in the country and posts them to public domain (identify from their website and provide a link). In the United States, the agency of Centres of disease control and prevention weekly publishes the Morbidity and Mortality weekly reports that provide results of surveillance for diseases worldwide.
Case series enable health researchers to frame an answerable question or rival hypotheses which can then be investigated using analytical study designs. Hypotheses are derived from “theories” or theories that explain phenomena. For example, in our story, Dr Dey was perplexed by the skin lesions he saw and he set up hypotheses that his patients may have had access to drinking water that contained arsenic in high concentrations, or they may have had exposure to high concentration of arsenic from some sources and these in turn metabolised in the body and would appear in urine. Accordingly he ordered urine tests for presence of arsenic. When the reports turned out to be positive, he was certain that it was their chronic exposure to arsenic that was responsible for the cases. However, it would still be necessary to conduct epidemiological studies to establish that in the population, that indeed was the case.
In addition, case series in environmental epidemiological studies can also be used to test hypotheses. Some case series methods for instance, the case crossover study designs and the self-control case series study design methods can be used to model single cases that occur over a time period to find out relationship between environmental variables such as extreme temperature and deaths, or hospital admissions or as Heather Whitaker wrote about certain strains of MMR (measles, mumps, rubella) vaccine usage and risk of aseptic meningitis 3 4

Ecological Study (Time series)

Ecological study designs are those study designs where aggregated data are obtained for both exposure and outcomes, and these data are then analysed together to test the hypotheses that exposure and outcomes are related to each other or are linked. For example, in studies of air pollution and health effects, air quality data are routinely collected from different stations throughout a city and from the same city blocks, hospital admissions data on certain health outcomes (such as total death, admissions due to heart diseases, or admissions due to asthma) are collected and these two entities are then analysed together. For example, in Sadiva and Dockery’s study in Brazil, they obtained data from Sao Paolo municipal authority on deaths and air quality data from 12 monitoring stations 5.
//Insert figure 1 about here
Sadiva and Dockery found in their study that for each 100 ug/L of increment in PM10 levels, the risk of deaths in the elderly go up by about 8%. Does this automatically mean that on a bad air day, the chance of an individual elderly to die was 8% ? The answer is “no”, because there may be other factors common to both poor air quality of air (or high air pollutant concentration) on a particular day and risk of death on the same day. To construct a fictitious example, say the city had fireworks display on a Sunday (holiday) and the outpatients department which refer patients for admission was also closed. On Monday, the air quality of the city would be bad because of accumulation of particulate matter from the fireworks display and also Monday being a working day, and the outpatients being open, would see a higher inflow of patients and possibly higher admission and death rates than when the hospital was closed for the weekend. Therefore, it would be wrong to claim that because there is a general agreement that following those days when the air quality is poor, hospital admission rates due to heart disease also go up for an individual as if no other factors can explain for this phenomenon. This error in judgment or fallacy is referred to as “Ecological Fallacy”, that is, based on ecological study results, one cannot generalise to individuals.

Cross sectional survey

A cross sectional sure survey is designed to generate a snapshot of a health problem in a community. This is both useful for some levels of hypothesis testing but also for estimating prevalence of a health outcome. For example, Professor D N Guha Mazumder et al (1998) conducted a large cross sectional survey of 7683 people in a the North 24-Parganas district in West Bengal state of India to study the association between arsenic exposure and skin lesions 6. While cross sectional surveys are useful study designs, they are not the best designs to understand cause and effect linkages. The reason is this:
1.Cross sectional surveys are open to recall bias from respondents.
2. In case cause and effect assessments, causes should precede health effects. In cross sectional surveys, it is impossible to be sure if the health outcomes actually preceded the exposure or whether they arose at the same time as this information is collected at the same time as collection of data on health outcomes.

Case Control Study

In a case control study, participants in the study are sampled on the basis of whether they have the disease in question. Those who have the disease are referred to as cases, and those who do not have the disease are referred to as controls. Both cases and controls are then assessed for the likelihood of their exposures. For example, Haque et al (1999) also reported a case control study in the same population where the cross sectional survey of arsenic toxicity was conducted. In that case control study, Haque et al (2003) studied 192 persons with skin lesions and 213 individuals without skin lesions and they followed up the these people for and sampled their water drinking water samples and then studied the association between various levels of arsenic in drinking water and the risk of skin lesions 7.
In case control studies, the likelihoods or odds of exposure are compared for cases and controls. Therefore, the effect measure is referred to as Odds Ratio (alternatively Likelihood Ratio). Refer to the following table. This table presents a fictitious example of a case control study. In this case control study, 100 cases and controls were asked about their exposure to “Exposure” and the investigators ended up with a table as follows:
Table 1. A fictitious example of a case control study

Exposure Cases Controls
Exposed 70 30
Non-exposed 30 70
Exposure Cases Controls
Exposed 70 30
Non-exposed 30 70
Total 100 100

As can be seen from this table, 70 out of 100 cases, and 30 out of 100 controls tested positive for exposure. Thus, the odds of exposure among the cases was 70:30, and the odds of exposure among the controls was 30:70. Hence the Odds Ratio 70 * 70 / (30 * 30) = 49/9, or approximately 5.4. If we were to replace the figures in the the above table with A, B, C, and D and reconstitute the table, the table who would show something like this:

Table 2. General scheme for case control studies

Exposure Cases Controls
Exposed A B
Non-exposed C D
Total A+C B+D

The Odds Ratio would then be estimated as OR = (A * D) / (B * C). This is also known as “cross product ratio” for finding out the odds ratio from one study or for one set of findings.
In case control study, it is possible to control for the effects of potential confounding variables. This can be achieved in three ways –
Cases and controls can be matched on variables that are thought to be potential confounders. For example, Haque et al (1999) in their case control study matched their cases and controls on the basis of their ages (within five years) and gender.

Stratified analyses

To illustrate this, two tables are set up as follows, one for men and one for women in the above fictitious study we used for the case control study illustration.
Table 3. The exposure disease association table for men

Exposure Cases Controls
Exposed 50 20
Non-exposed 10 40
Total 60 60

Table 4. The exposure disease association table for women

Exposure Cases Controls
Exposed 20 10
Non-exposed 20 30
Total 40 40

In this fictitious example, the Odds Ratio for men was 10.0, while that for the women was 3.0. While in both groups, there was an association between the case control status and then exposure and in the same direction, the magnitudes were very different, and when something like this happens, it provides an indication that a confounding by that “variable” has occurred; thus, in this case control study, you may conclude that confounding by gender has occurred. Therefore while the crude OR expressed above 5.4 indicates that in general, without adjusting for the effect of genders, this may be the extent of association but it is not accurate as it does not adjust or control for the effect that men and women have different effects. Hence, the adjustment is done as follows:
(50 * 40 + 20 * 30) / (20 * 10 + 20 *10) = 2600 / 400 = 6.5, or the Odds Ratio was 6.5. Note that this Odds Ratio is between 10 and 3 and is more than but not too far away from the crude OR of 5.4. As before, the algebraic equation for this situation is as follows:
The Pooled Odds Ratio = OR(mh) = (A1 * D1 + A2 * D2) / (B1 * C1 + B2 * C2)

Multivariate Analysis

Logistic regression is the analysis of choice in case control studies. A detailed description of the theory and practice of logistic regression is beyond the scope here, so a brief description of the principles is given. In logistic regression analysis, the logit function of an outcome is modelled on the variables. A logit function is essentially a logarithm of the odds of an event. For example, let’s say we found out of 100 cases, 70 individuals had the exposure that we wanted to study. Expressed in logit, it would be log(70/30). Usually, natural logarithms are used for this analysis. The logit function is then regressed in a linear model on the exposure and confounding variables. The simplest model looks like so:
Logit(Y) = alpha + beta * X
Where alpha is the intercept and beta is the beta regression coefficient for the exposure variable X. X can be a binary variable (taking the value of 0 and 1, or X can be a continuous variable, or X can be an ordinal variable. More on the variables in the data analysis section).
Lets say X is a binary variable and has a value of 1, or 0 where 1=exposure to the environmental agent, and 0=non-exposure to the environmental agent. Then, according to this equation, when X is set to 1,
Logit Y for exposure = alpha + beta(X = 1) = alpha + beta … (1)
When X is set to 0 (that is non-exposure), then,
Logit Y for non-exposure = alpha + beta
(X = 0) = alpha + 0 … (2)
If we deduct (2) from (1), we have,
Logit Y for exposure – Logit Y for nonexposure = beta … (3)
As logit is logarithm, and as is the rule of logarithms that when one logarithm is deducted from another, they are actually dividing each other, equation (3) is actually
Log (Odds of Y for exposure/ Odds of Y for non-exposure) = log(Odds Ratio) = beta .. (4)
Therefore Odds Ratio = exponential (beta) … (5), we raise 2.713 (that is the constant e), to beta.
Note that because this is a linear model, we can add many variables to it. This will be explored in the data analysis section. In a multivariate logistic regression, many variables can be added to the equation. For example, Haque et al (2003) conducted a multivariate logistic regression to test the association of arsenic exposure to case control status, the cases being those with skin lesions and controls being those who did not have skin lesions.
Two known disadvantages of case control studies are that they are retrospective, and subject to recall bias. In a case control study, because data are collected on the basis of identifying individuals with outcomes, exposure data are collected retrospectively; as a result, this study design cannot control for the time, that is, it cannot be ensured that the exposure preceded the outcome. However, that said, in case control study designs, it is possible to study multiple exposure variables or for the same outcome measure. It is also possible to study rare diseases, as the sampling of individuals are done on the basis of their outcomes. Rather than waiting for the outcomes to occur for a common set of exposures, it is possible to actually start with an outcome and then sample individuals on that basis and study the possible occurrence of exposure in among case cases and controls.

Retrospective Cohort Study

In a cohort study, cohorts are assembled and then they are followed up in time. Cohorts are similar groups of individuals, in this case those who are and who are not exposed to an exposure variable of interest. This can be done retrospectively using historical data as well as done prospectively. When the cohorts are assembled in historic time and then they are also “followed up” in historic time (that is in time that has preceded the time of inquiry of or time of conduct of the research), this type of cohort study is referred to as “retrospective cohort study”. Retrospective cohort studies are frequently conducted in workspace settings and particularly useful in occupational epidemiological study designs. For example, in 1980, Bengt Sjogren and colleagues reported the results of a study on welders who welded stainless steel and therefore were exposed to hexavalent Chromium. As hexavalent chromium was a known cancer causing agent in animal studies, Sjogren and colleagues wanted to study what would happen to workers who were exposed to high concentrations of Chromium occupationally. For this, they obtained data on welders in Sweden who were exposed between 1950-1965 and followed their health records till 1977 8. They found that while the standardised mortality ratio for other cancers were similar for the welders and general members of the public, welders had higher risk for lung cancers.
## Prospective Cohort Study
In a prospective cohort study, cohorts of participants are assembled before the commencement of the study. The cohorts of participants are assembled on the basis of whether they are exposed or not exposed to the environmental factors of interest. Also, at the beginning of the study the members of the cohort must be free of the outcome of interest. For instance, imagine that a cohort study is being conducted to test the theory that workplace induced noise leads to hypertension in the employees. The hypothesis being tested is exposure to noise in a particular factory shop floor workplace leads to development of hypertension after five years of working there compared with non-exposure to noise while working in the same factory. To test this hypothesis, employees can be assembled into two cohorts: both cohorts should be free from hypertension to start with; one cohort group members are exposed to constant ambient noise in their place of work and the other cohort could be selected from office desk jobs in the same factory but those who are removed from the factory shop floor noise. After this, the cohorts of participants are periodically examined for the signs of developing hypertension and are compared. The effect measure for a cohort study is the Relative Risk estimate where incidence rates of the disease are compared for exposed and non-exposed. Cohort studies, specifically prospective cohort studies, are advantageous in the sense that a number of different diseases can be studied that can arise out of the same source of exposure. For instance, exposure to noise as a stressor can lead to other stress related diseases such as diabetes or premature balding or depression. A second advantage of prospective cohort study is the ability of nesting other case control studies which is described next. But cohort studies are also expensive and time consuming.

Nesting a case control study

A nested case control study is “embedded” within a prospective cohort study. At the beginning of the main study, potentially useful exposure data such as blood samples are collected from every member of each cohort. Then, after a certain period of time, when number of individuals show specific health health effects or health outcomes of interest accumulate, then a case control study is conducted, based on exposure data collected in the beginning of the study. This study design overcomes the disadvantage of response bias that can occur in a regular case control study.


This was a brief introduction to the main principles of study designs in environmental health. Epidemiological study designs are very important for establishment of an association between an ex environmental agent and a health outcome.
In the arsenic to toxicity studies, after the initial observations that led to the discovery that people who lived in the Gangetic delta in both India and Bangladesh, were exposed to high arsenic concentrations in their drinking water. The source of this drinking water was from shallow tube wells that were dug to obtain ground water for irrigation but also was used for drinking. This led to identification of hundreds of millions of people who were exposed to high concentrations of inorganic arsenic in their drinking water by the epidemiologists, geologists and environmental health experts.


  1. Wilson, J. G., Kingham, S., & Sturman, A. P. (2006). Intraurban variations of PM10 air pollution in christchurch, new zealand: Implications for epidemiological studies. Science of the Total Environment, 367(2-3), 559-572. doi:10.1016/j.scitotenv.2005.08.045
  2. Sadiva, P. S., & Dockery, D. D. (n.d.). Air pollution and mortality in elderly people: A time series study in sao pauolo, brazil. Archives of Environmental Health, 50(2), 159-163.
  3. Nitschke, M., Tucker, G. R., Hansen, A. L., Williams, S., Zhang, Y., & Bi, P. (2011). Impact of two recent extreme heat episodes on morbidity and mortality in adelaide, south australia: A case-series analysis. Environmental Health, 10(1), 42. doi:10.1186/1476-069x-10-42
  4. Whitaker, H. W., Farington, C. P. F., Spissens, B. S., & Musonda, P. M. (2005). The self-controlled case series method. Statistics in Medicine, 0, 1-31.
  5. Sadiva, P. S., & Dockery, D. D. (n.d.). Air pollution and mortality in elderly people: A time series study in sao pauolo, brazil. Archives of Environmental Health, 50(2), 159-163.
  6. Guha Mazumder, D. N. G. M. (1998). Arsenic levels in drinking water and the prevalence of skin lesions in west bengal. Int. J. Epidemiol, 27, 871-877.
  7. Haque, R. H., Guha Mazumder, D. N. G. M., & Smith, A. S. (2003). Arsenic in drinking water and skin lesions: Dose-response data from west bengal, india. Epidemiology (Cambridge, Mass.), 14, 174-182.
  8. Sjögren, B. (1980). A retrospective cohort study of mortality among stainless steel welders. Scandinavian Journal of Work, Environment & Health, 6(3), 197-200. doi:10.5271/sjweh.261

Assessing Health Effects

< !DOCTYPE html>

Health Effects Assessment

Environmental Health Effects measurement


As the Indian Airlines flight landed in the Birsa Munda airport in Ranchi, India, we tightened our seat belts and looked out of the tiny aircraft window. The Roro hills rose in the horizon as the aircraft turned towards the tarmac. Our journey began.
We climbed into the car that Peter sent for us and we were greeted with the lush greenery around us and the hill that was out there, where company X mined Asbestos for years and then abandoned the hills and the little township. The following morning, the five of us were joined by a journalist, Mimi, and we went up the hill along a little dirt track. The climb was steep, but it was not hard, The morning sun shown overhead, and we were sweating by midday when we went up. Around the mine, a few people greeted us. A few short, slender men were hanging about, a few women were out there searching among the rubbles of what looked like piles of blue asbestos rocks piled there. Children were playing. A lot of dust were generated and people did not seem to care.
“Study the people Arin, “, Peter asked me pointing to the men, women and children, “What would you say? These were all people who used to work the mine. The company closed the mine and moved on. Except they did not bother to close the tailing ponds properly.” Peter guided the group of us to show a few of these tailing ponds. It was obvious that the company did not bother to close them properly and the people here were exposed to high concentrations of asbestos both in the air and they were playing with the piles. Something needed to be done, urgently.
In the afternoon, we came down from the hills, and organised a camp where we were examining the local people from the area. They mostly complained for breathing difficulties, some had severe backaches, and what you’d consider pretty regular stuff, routine complaints. As we we examined the people and listened to their stories, it turned out that nearly all of them suffered from some from of form of respiratory problems, and people who worked in the mine for years also had severe backache and severe bodily pains. They were also hungry and tired.
In the afternoon, we met with the local doctor and the chief medical officer in Roro hills areas. It was astonishing to see the hand waving and wishing away that this guys were showing. Between our cups of chai, the doctor assured us that these are were all poor people and what we were seeing were just good old tuberculosis. Nothing to do with asbestos mining.
Then out popped an Xray plate with almost like a small coin shaped opaque shadow across the chest and streaks of white strands in the fields. Mimi was insistent, what these could be. The doctor smiled, and said something like “could be asbestosis”.
Could be? There was silence. Someone discovered blood stains in the beds of patients they saw in their rounds in the hospital that morning. A story of mass exposure to asbestos was unfolding.

Conceptual Introduction of Measuring Health Effects

All environmental exposure lead to one or other health effects; some of these are undetected or subtle, others have manifestations ranging from minor discomforts to death. Health effects in individuals and populations can be measured in a number of different ways. At the individual level, this could be symptoms and signs. Symptoms refer to “complaints” that individuals verbally express to their health caregivers (nurses, doctors, other caregivers). The term “sign” refers to the findings that a doctor discovers upon examining a patient. The signs and symptoms taken together lead a clinician to order more tests and additional investigations for the individual to confirm a diagnosis before treatment can be prescribed. At the level of populations, these different signs and symptoms are considered together in the form of diagnoses and the diagnoses are tallied. Occasionally, the disease conditions or biological or physiological parameters are summarised to indicate the level of health states of entire populations. When exposure to an agent in the environment results in physical harm, the process is known as toxicity. An exposure can therefore a toxic exposure if it is capable of resulting in physical illnesses that can result in significant disability or even death of a number of individuals.


Toxicity refers to the capacity of an environmental variable to cause harm (EnvironmentalEpidemiologyA 1999). Toxic agents target either a tissue or a specific set of target molecules; toxicity in turn can develop over a short period of time (acute or toxicity), or it can take time to develop (chronic toxicity). For example, recently Jonathan Klick and Joshua Wright (2012) wrote a concept paper where they claimed that reusable cotton bags that are often used in lieu of plastic bags used for shopping harbour E coli bacteria, and these grow if the bags are kept in the trunks of cars (Klick 2012) . In turn, they argued that rather than using cloth bags for shopping, it’d be more cost effective if shoppers continued to use plastic bags. But not here the argument: the cloth bags are used to carry food items; as a result, E coli contaminate the food and in turn when these foods are consumed either cooked or raw, E coli from the food enter the body where they result in clinical manifestations of diarrhoea, vomiting, and fever. These events occur relatively rapidly, over days, and hence this toxicity is termed as acute toxicity. On the other hand, people may be exposed to high concentrations of air pollutants for years, and over years, some people manifest signs of respiratory difficulties, but these occur following years of exposure. While acute or chronic toxicity refers to the state of development of adverse health effects over time and the speed of this occurrence, it is also important to appreciate that effects can span quite a range of manifestations (the spectrum of effects) which we study now.

Spectrum of Effects

At one end of the range of effects (spectrum) is “no effect”, that is, the person is either healthy, or complains of minor discomfort; at the other extreme, an individual can die. For example, on days of high pollution, some people complain of irritation in their eyes and nostrils. While these present significant discomfort for these individuals, none of these health effects are life threatening. Some changes are subclinical changes where subtle mechanisms occur and these may be the first or initial signs of changes that occur but clinical manifestations may not occur at all; however, understanding subclinical changes are also important as these enable actions to be taken earlier than clinical manifestations and can lead to better prevention.

On the other hand, with acute toxicity, think of what happened in Bhopal in the night of Third December, 1984. On that day, immediately 3, 800 people who lived near the Union Carbide Plant in plant in Bhopal, India, died as they were suddenly exposed to very high concentrations of methyl isocyanate gas that leaked out of the plant, and then entered their body bodies (Broughton 2005).

Something similar happens when people are exposed to high concentration of carbon monoxide gas from burners and incomplete combustion and retrograde flow of these gases (Prockop 2007). Between these two extremes of discomfort and death, other health effects or health conditions make up the spectrum; in other circumstances, people may suffer from loss of functionalities. For example, Sobngwi and colleagues (2004) suggest from their research in Africa that urban enviromment and migration to urban environment and environmental stressors can be implicated in the emergence of hypertension (high blood pressure) and diabetes (Sobngwi 2004). The point here is this that, exposed to environmental stressors, some people might develop diabetes, yet others may not develop any disease at all. This suggest that a range of reactions as well for individuals who are exposed to the same levels of environmental stressors. Why could this be? One possible explanation is that, as people are genetically diverse, these could be related to how their genes and environmental variables interact to produce the health effect or react to the environmental stimuli.

Relevance of Genetics to Environmental health

Genes are complex to define. At a conceptual level, think of genes as units of heritability that physically consist of “strings” of triplets (each triplet is a combination of three out of four nucleotides: Adenine, Guanine, Cytosine, and Thymine), and functionally, genes code for proteins that perform different functions in the body (resistance, carriage of molecules, conduct biochemical reactions). Each of us have about 300, 000 genes distributed in 23 pairs of chromosomes (22 chromosomes are known as autosomes and one pair of sex chromosomes).

Our environmental conditions impact by changing genetic structures and functions in different ways. Some of these changes go unnoticed as no major changes occur. Others manifest in the form of genetic disorders. Some genetic disorders occur are due to arrangement of the chromosomes themselves. For example, Down’s syndrome is a disorder characterised by specific physical feature and mental health issues. Here the problem is trisomy of the chromosome 21; instead of a normal of two copies of Chromosome 21, we get to see three copies of Chromosome 21. Genes also undergo changes in the form of mutations or single nucleotide polymorphisms. In single nucleotide polymorphisms, one out of four possible nucleotides Adenine, Guanine, Thymine, and Cytosine at a particular locus on the chromosome/gene is substituted or replaced by another. This can also be say for example an additional insertion of a nucleotide or a loss of a nucleotide. These in turn lead to a different configuration of the gene and thus, gene product. SNP is labelled where the prevalence of such change is more than one percent in the population; if the prevalence of these changes are less one percent in the population, the phenomenon is known as mutation (Wild 2005).
Numerous environmental mutagens have been identified. For example, Ahsan et al (2007) reported that exposure to inorganic arsenic in drinking water often show single nucleotide polymorphisms (Ahsan 2007). People who live in radiation hazard sites are at high risk of mutation and therefore manifest different types of genetic disorders. For example, in the state of Jharkhand in India, in a place known as Jadugoda, uranium were mined for years. Exposure to radioactive minerals from the uranium mines have resulted in significant health effects for children and people who live in the area.
In addition to direct impact on genetic structure and therefore alteration of gene functions, epigenetic changes also occur as a result of exposure to environmental agents. In epigenetic changes, heritable changes occur even without any alterations in DNA sequence (Baccarelli 2009). In general, three common mechanisms are proposed: these include changes in histone structure of the chromosomes, or methylation of DNA, or formation of micro RNAs.

Acute versus Chronic Effects (latency)

We discussed in the above sections that health effects occur as a result of being exposed to specific environmental agents. We discussed that such effects can be toxic and these toxic effects have a range of effects and this range in turn can be explained on the basis of the environmental effects on the genetic structure and functions. There are also issues around health effects occurring over time. In this conceptualisation, health effects can be acute versus chronic. Acute effects are those that occur over a short period of time. For example, consider the case of people who are exposed to sudden high concentrations of carbon monoxide from incomplete combustion (trapped on a snowy night in a car and could not get out); unfortunately they die within a short span of time being exposed. As another comparable example, people who were exposed to heavy concentrations of methyl isocyanate from the Bhopal plant, died almost instantly after being exposed to such high concentrations of the gas. Such health effects are known as acute effects.

On the other hand, some health effects take a long time to develop. For example, those who are exposed to inorganic arsenic through their drinking water supply, the initial manifestation of health effect in the form of skin lesions often take years, typically anywhere between five and ten years (Smith 2000) . As another example, Bianchi et al (1997) estimated latency period of 300 individuals in various trades exposed to Asbestos and found that the latency period of appearance of mesothelioma can range between 14 to 72 years (Bianchi 1997).

For most environmental toxins, the onset of disease and its manifestation takes time. This interval can be partitioned into two phases. The initial phase is referred to as the induction phase where from the time of exposure a time gap is observed till the biological process begins which can be manifested in altered metabolism, for instance. The disease has not yet manifested clinically; this stage is referred to as subclinical phase of the disease; then, from the first point of subclinical manifestation till the onset of the symptoms and signs, the time interval is referred to as latent period (Figure 1). For some disease and for some exposure both the subclinical and the latent period are short; for example, for influenza, both the latent period and the subclinical phases are short. On the other hand, cancer causing agents typically have a long latent period.

Specific versus non-specific effects

For some health effects, it is not difficult to identify the possible environmental exposure that preceded it. For example, people exposed to high concentrations of arsenic (either in the drinking water for long periods or from other sources such as occupational exposure to copper smelting for example), manifestations of skin pigmentation occur after a long period of time but the manifestations themselves are quite specific – these include very specific dark and brown pigmentation on the skin of the back. Other health effects are generic. For example, people who are exposed to asbestos develop symptoms and signs of respiratory illnesses that resemble tuberculosis and are often misdiagnosed with tuberculosis or silicotuberculosis.


The term susceptibility refers to the proneness of an individual to manifest a particular disease or health effect, more or less when compared to other individuals who are also exposed to the environmental variable. For example, it is known that the malarial parasite known as Plasmodium falciparum causes a particularly virulent form of malaria. Yet not all individuals exposed to the same parasite under the same environmental condition will develop falciparum malaria. Individuals with specific blood groups (Bombay blood group) and serotypes that even when exposed to P falciparum will not develop the disease. Another example is gluten enteropathy. Conleth Feghiary argues that much of coeliac disease is still unclear. People who are susceptible to glutens will develop a very specific type of intolerance to glutens and show symptoms and signs of enteropathy(Feighery 1999). Hyperreactivity and hypersensitivity are two forms of susceptibility patterns which we discuss next.

Hyperreactivity versus Hypersensitivity

Hyper-reactivity and hyper-sensitigity refer to the situation when people are exposed to specific agents, they develop severe health effects and responses characterised by increased flow of blood and often features of shock or loss of blood pressure and sudden collapse. For example, when exposed to specific allergens in food or in the environment, people develop rashes, skin inflammations, and inflammatory conditions in the bronchial tube (asthma). The first time individuals encounter an allergen, the reactions may be minor, but once sensitised, when they are exposed a second time, individuals may show very different and severe form of response, and quite often these reactions are life threatening. Anaphylactic shock is a condition that occurs due to extreme hypersensitivity to specific environmental agents such as drugs or pollens or specific animal products.

Thus far, we discussed some issues around environmental exposure and resulting diseases – how they developed, the time span over which they develop (acute versus chronic), and the variability of their manifestations (severe versus less severe) and we looked at the susceptibility of individuals and populations. Let’s start reviewing how we can measure health effects.

Case Definition

In health effects assessment, a precise case definition is the first principle. The term “case definition” refers to the process of clearly setting criteria for selection or exclusion of of a disease process. A precise case definition should allow one to identify a case of disease when these signs and symptoms are observed, or rule out diseases and health conditions or health states. During outbreak investigations, this is the first step that epidemiologists start with.

How Might We Measure Health Effects?

Several issues need to be considered when we embark on answering this rather broad based question. We review data sources, and linkages for both primary and secondary data. We review sources of both secondary data and primary data sources – databases, and collecting primary data from individuals and samples drawn from larger populations.

Data sources (primary versus secondary data)

A major source of population based health effects data is that data collected and stored in the form of secondary data such based on large scale health surveys conducted by the Ministry of Health or the World Health organisations or other organisations or reported in the literature. As these data are not collected primarily by the researchers themselves but the researchers rely on other sources to obtain and source the data, therefore these are referred to as secondary data. Primary data, on the other hand, refer to the data obtained from first hand (“primary”) surveys such as cross sectional surveys or studies conducted such as case control studies or cross sectional surveys on specific health related states. In the following paragraphs,we discuss several different sources of secondary data

Data linkages

The term “data linkage” refers to the condition where individual data sets are connected with each other using a range of different identifiers and connectors. For instance, let’s say we have obtained a data set on climate variables and we have also access to another data set on hospital admissions from the same region or city. For example, in large scale air pollution studies that link ambient air quality and different health parameters, data on air pollution or ambient air quality are collected from individual stations for a particular city over a period of time. Then for the same city, researchers collect data on hospital admissions (could be due to asthma or cardiovascular illnesses) from different hospitals. Then the two datasets are then joined or linked on the basis of finding a common identifier.

Sources of Mortality and Morbidity Data

Mortality Data

The death or mortality data are commonly sourced from the death register registers. These might include: hospital or morgue records. In addition, for different countries, the State collects such data periodically and makes them available to investigators and researchers and health professionals. The common measures are death rates, or age-specific death rates.

Morbidity Data

Human illnesses or health related states data (morbidity data) are obtained from the ministry of health databases, or from individual hospitals or workplaces, and schools or indeed other areas where some officials keep record of people’s health status as they enter to interact with businesses or do their activities. For hospitals, this would mean patients who attend the inpatients or outpatients departments. In small clinics, this might mean all those who attend the various clinical facilities. In the context of the workplaces, even before the employees enter the job, pre-employment checks are made. Then, when the employees start their jobs, another round of examination of health states are made and the data are maintained. During the various phases of employment health checks are conducted. Thus workplace records may provide a rich source of health records from where health data can be abstracted and studied. Similarly with school based physical records and health checks, records can be accessed and studied. In addition to these commonly available sources, disease registries such as congenital disease registries and cancer registries are common sources of data for health care data analysis. The common measures used for morbidity are prevalence, incidence, age-adjusted and age-standardised rates of incidence and prevalence.

Hospital Records

Hospital records are maintained when a patient first reports to the hospital or are referred from some source to the hospital. During these clinical encounters, hospital clerks or the clinical personnel enter data on their age, gender, socioeconomic status, eligibility or availability of insurance, and from the clinical side, detailed history and records o of physical examination, and instrument based acquisition of data (that is X-rays, or ultra sonograms, or other means) are taken and the records are then maintained.

Employment Data

Before joining work, at many workplaces, employees are routinely examined for their physical fitness or checked for the presence of specific illnesses that are relevant to the job they are joining. For example, the physical fitnesses are routinely tested for people in the police force. These data frequently include respiratory flow, vital capacity and other tests. Many places also test the employees for their blood sugar level, high blood pressure level (thence hence they measure their blood pressures periodically), and vision are tested as well. Many people who work for software companies (where the job stress may lead to complaints such a carpal tunnel syndrome or other orthopaedic disorders, it is common for the employees to undergo orthopaedic check ups). The these data can be accessed or a be made luna aa available to the employees or researchers upon request. These data may also be made available on in the form of anonymised records.

Disease registries

Disease registries are common sources of health or morbidity data. Many countries keep registries for diseases that are either rare or diseases that o are of special interest. For example, many countries maintain birth disease or congenital disease registries. Most countries in the world maintain cancer related disease registries. Internationally, the IARC (International Agency for Research on Cancer) section of the World Health Organisation coordinate and share data among different cancer registries. In the cancer registry, the and other health providers are provided forms where they will have to enter details about the cancer diagnosis of a patient. Cancer disease registries also conduct health surveys or cross sectional surveys from time to time.

Health Interview Surveys

Many countries and the international agencies periodically conduct cross sectional surveys. These are cross sectional surveys are based on health interviews. In these surveys, a representative sample of the individuals from the population are surveyed for their specific health states. Data from the health surveys can be combined with different ways with various other data sources.

How to draft questionnaires for health effects data collection

Questionnaires are important means to access and collect health related data. Some principles of constructing questionnaires are as follows:

  1. It is important to avoid leading questions, in the form of direct questions (“Are you a smoker?” as opposed to “How many cigarettes/day you smoke” and then providing a series of choices)
  2. Demographic and income level data are collected at the end of the survey as people do not like to disclose them at the outset and seeking these information may put off some individuals.
  3. The questions need to be simple and clear, and easily understandable
  4. Frame the questions using multiple choice answers and within the scope of multiple choice questions, the choices are made in a way that enable tapping of the opinion of the person answering the survey.

All survey questionnaires need to be validated before using in the field. There are three forms of validation: face, content, and construct validation. The first and the easiest form of validation is the face validation. In face validation, the questionnaire is presented to lay people and their opinions on the the readability and presentation of information are sought. If there is evidence of agreement among the different users of the questionnaire that the questionnaire is easy to fill in and intuitive and can express what is needed clearly, then the questionnaire would have passed the face validity test. Content validity is established by providing the questionnaire to experts who will rate that questionnaire on the basis of whether the items in the questionnaire can tap all the important concepts that the questionnaire is set to measure or covers the most important content areas or whether the questionnaire is missing in any specific content that must be included given the purpose of the questionnaire or the data collection exercise. The extent to which the experts can agree on the this particular aspect of the questionnaire is determined using an agreement statistic. Finally, using a preliminary survey, the constructs or the concepts on which a questionnaire is based are identified. This process is referred to as establishment of the construct validity of the questionnaire. When the face, content, and construct validity are established, then the questionnaire is ready to be deployed in the field as it has been formally made internally valid.

Physical Examinations

For individual data collection, physical examinations are important. In particular for clinical examination, physical examinations immediately follow history taking. Here, a doctor actually examines a patient and takes notes. An example would be a doctor measuring a patient’s blood pressure using a sphygmomanometer and stethoscope. These the results of the examination are noted in the form of physical examination report sheet or history sheets, and then these are made available or used for deciding the differential diagnosis of a patient.

Physiological Measures

Several different types of physiological measures are also taken to confirm the initial diagnoses arrived at for individual patients. For example in the planning of care for diabetics, doctors measure glycosylated haemoglobin, where non-enzymatically glucose to haemoglobin is measured to test the extent of glucose control for patients with long standing diabetes or those with diabetes who are tested for adherence to treatment or response to treatment for the life span of red blood cells (roughly 120 days). Another example might be use of forced expiratory volume measurement of patients to test for lung functions.

Laboratory Testing

Laboratory measures are commonly done to establish diagnoses and assess prognoses of patients. Samples are obtained from almost every system of the body an and can be tested in the laboratory to arrive at estimates of whether a person is suffering from one or other diseases. For example, thyroid function tests (measurement of tri-iodo-thyronine, T4, and thyroid stimulating hormone levels) are commonly done to test the functional status of the thyroid glands. HbA1c test is another example.

Biomarkers and Genetic Markers

The laboratory testing services can provide information about different biomarkers of exposure and disease conditions and can provide in important information about the levels of control following establishment of treatment or policy or implementations. The biomarker based assessments are based on the assumption that bodily functions can be assessed by measuring surrogate identifiers that are present in the blood or breath or other body fluids in sufficient quantities and can provide a clue as to the functional status of the different organs and organ systems.

Genetic markers are used for identifying and testing genetic basis of the disease or health conditions. For example karyotyping is a process where chromosomes are mapped and laid out. This is done to identify chromosomal abnormalities. In other cases, segments of genes are spliced and amplified using procedures such as polymerase chain reactions (PCR) and these segments are then tested for mutations and presence of genetic variations. Such procedures are based on genetic markers of specific diseases or health related states that are being discussed.

Methodological Issues in Health Effects Measurement

Validity and Reliability

The term “Validity” refers to the concept that we should be able to measure what we set out to measure. For example, if HbA1c is used to measure the extent of blood sugar control in the intermediate term (about four months), then, this is what is it should measure. Likewise, if the test is or the survey instrument is aimed at measuring a o person’s opinion or mental state then that the instrument should accurately measure the mental state and nothing else based on the constructs that are chosen for that measure.

Reliability of an instrument or a procedure refers to the fact that if the process is repeated with the same instrument or the same procedure with the same set of individuals within a reasonable interval of time when the conditions are unchanged, then in sequential results, the for the same condition and with the same set of people, the results will be very similar if not identical. For a specific disease condition or measuring or mapping health states, the key assumption is that, the measure must be both valid and reliable. Let’s see how this relates to two sources of variation we can expect.

Inter-individual variability

Inter-individual variability refers to the variation that is observed when two or ore more individuals are given the same test and the extent to which their responses differ in these tests.

Intra-individual variability

Intra-individual variability refers to the situation where the for the same individual, if the test is repeated over a period of time, the extent to which the results will vary. The extent to which the results vary may depend on a number of issues. It can be that the instrument itself was not very good and perhaps ambiguous and so while the individual provided one set of answers at time point A, he or she was prone to understand the questionnaire differently and provided another set of responses in time point B. This resulted in the instrument returning in two different sets of answers in two different times, although the conditions did not change. On the other hand, it is possible that the test are repeated after a long time and in the meanwhile the people on whom the tests are repeated also changed. When this happens, the measurements are different for the same person over two different periods of time. This is known as secular change.

While designing a measurement tool, both these issues need to be kept in mind for the tool developer. For the same individual with the same condition, and if the conditions do not change much, over a short period the measurements cannot vary much. Similarly, for a sensitive tool, for the range of individuals with roughly same conditions, the measurements returned must be similar an dam and cannot vary too much.

How Environmental Health affects different organ systems

Different environmental factors impact human health and human organ systems in different ways.

Skin and Skin disease

Sunlight or UV rays impact skin pigmentation. UV rays are also responsible for skin cancers. Exposure to arsenic causes depigmentation. Agents that damage the liver (hepatotoxicity) results in yellow discolouration of the skin (jaundice). Structurally, the skin has several layers (epidermis, dermis, and hypodermic fat). The dermis contains the roots of the hair follicles, and other elide epithelial tissues. The dermis region also contains blood cells and connective tissue. The epidermis layer contains squamous epithelial cells and melanocytes. Each of these tissues or tissue systems can be get impacted by differ went different environmental stimuli and therefore get impacted differently. Arsenic, mercury, other heavy metals, UV radiation, fungal infections, corrosive agents – all of these present different extent to which skin is affected as a result of environmental exposure.

Respiratory system and respiratory health effects

Respiratory system starts with the nostrils or oral cavity and continues with larynx, pharynx, the trachea, the bronchi, bronchiole, ending up in alveoli. The alveoli are connected across the lung parenchyma and the surrounding vascular spaces. Toxins can cross over across the alveolar boundaries, for instance, air pollutant particles, and asbestos fibres. The bronchial muscles can be hypersensitive to allergens and react to smoke. Exposure to cigarette smoking and industrial smokes and industrial agents are responsible for different types of interstitial lung diseases.

Central and peripheral nervous systems and associated illnesses

The nervous systems include the central (brain and spinal cord), the peripheral (the sensory and motor nerves that branch out of the spinal cord), and the autonomic nervous systems (some cranial nerves – the vagus and accessory nerves, and the sympathetic nervous systems). Several environmental e agents impact neve nervous system. For instance, mercury can impact the peripheral nerves; environmental agents that can cross the blood brain barrier can impact the brain and the central nervous systems. Electromagnetic radiation from cell hon phones for instance is known to impact central nervous system.

Gastrointestinal system and associated diseases

These include biological vectors such a as E. Coli and other bacteria and viruses. Liver is affected by exposure to industrial solvents. When industrial solvents come in contact with skin or inhalation of the fumes which then reach the liver via the blood stream, affect the liver (hepatocellular cancer). Inorganic arsenic can be both inhaled and ingested and this then reaches the liver, and biotransformed into methylated arsenicals, can damage the DNA and lead to health effects. The health effects include skin lesions and cancer of the different organs. Exposure to hepatitis virus as an occupational risk for people who work with blood transfusion and nurses and surgeons who need to deal with blood blood and body fluids.

Hematological disorders

Several environmental agents cause damage to hematopoietic tissues. Haematopoietic tissues are those that are involved in the regulation of blood formation: typically these tissues are in the bone marrows or in the kidney (erythropoietin generating tissue). High levels of radiation can damage the haemotatopoeietic tissues. This lead to either severe anaemia and leukaemia. For example, it is known that benzene is a bone marrow suppressor, and thus chronic or long term exposure to benzene leads to chronic anaemia. Exposure to benzene usually occurs in laboratory technicians who work in pathology laboratories that handle benzene dyes (and also occupationally in other groups where benzene or toluene dyes are handled). Petrol pump attendants are also exposed to benzene from the fumes out of the petrol tanks during filling.

Immunological disorders

Immunological disorders refer to those disorders where the immunological system of the body is at risk or compromised in action. For example, infection with agents such as HIV can lead to inn immunological disorders or immunosuppression.

Reproductive disorders

For men, any exposure that interferes with the sperm production is important. For example xxx chemical agents are found to be associated with reduced sperm counts. For women exposure to extreme stress and hormones in food or exposure to steroid s can lead to reproductive problems in women and can lead to interference with ovulatory cycles.

Hormonal disorders

Occupational exposure to radio iodine can lead to thyroid cancer. In parts of the world where iodine content in thesis soil is low, people sure suffer from iodine deficiency goitre or hypothyroidism is common.

Environmental Cancer

Environmental cancer refers to the those cancers for which an environmental agent can be implicated. For example, it is known that exposure to high frequency microwave radiation is associated with the risk of brain tumours. Exposure to ultraviolet radiation is associated ti with the high risk of development of skin cancer among people who work outdoors.

How to investigate environmental cancer

Cancers being rare diseases, case control studies are commonly used to study the associations between environmental exposure and cancers. In case control studies, people with and without cancers are selected. Those with cancers are referred to as cases and those without cancers are referred to as controls. Their likelihood of exposure are then estimated and measured through questionnaires and direct measurements. Other study designs include secondary data analyses and ecological studies. In ecological studies, exposure levels are aggregated and cancer are aggregated and then the two aggregated measurements are regressed one on the other.

Epilogue and summary

This was a brief tour of the different types of health effects and health effects measurement. We started with the concept of the health effects measurements here, and the different spectrum along with which health effects fall. We discussed variations in health effects that op people demonstrate although they may be exposed to very similar levels of exposure. We discussed different types of study designs and health effects and ways t study health effects in populations.

In the end, the typical respiratory diseases that we found in Roro hills was due to the exposure of people to the Asbestos in the mine. We conducted a cross sectional survey to describe the high prevalence of respiratory and low back pain problems among the people who lived there and had previously worked in the asbestos mine. Asbestos is a dangerous chemical to be exposed to for prolonged period of time and can be quite hazardous as numerous studies have shown.


Ahsan, H., Chen, Y., Kibriya, M. G., Slavkovich, V., Parvez, F., Jasmine, F., … Graziano, J. H. (2007). Arsenic metabolism, genetic susceptibility, and risk of premalignant skin lesions in bangladesh. Cancer Epidemiology, Biomarkers & Prevention : A Publication of the American Association for Cancer Research, Cosponsored by the American Society of Preventive Oncology, 16(6), 1270–1278. doi:10.1158/1055–9965.epi–06–0676

Baccarelli, A., & Bollati, V. (2009). Epigenetics and environmental chemicals. Current Opinion in Pediatrics, 21(2), 243. Retrieved from Google Scholar.

Bianchi, C. 1., Giarelli, L., Grandi, G., Brollo, A., Ramani, L., & Zuch, C. (1997). Latency periods in asbestos-related mesothelioma of the pleura. European Journal of Cancer Prevention, 6(2), 162–166. Retrieved from Google Scholar.

Broughton, E. (2005). The bhopal disaster and its aftermath: A review. Environmental Health: A Global Access Science Source, 4, 6. Retrieved from Google Scholar.

Environmental Epidemiology : A Textbook on Study Methods and Public Health Applications. (1999). Environmental epidemiology : A textbook on study methods and public health applications. [Genève]: World health organization. Retrieved from WorldCat.

Feighery, C. (1999). Fortnightly review: Coeliac disease. BMJ: British Medical Journal, 319(7204), 236. Retrieved from Google Scholar.

Klick, J., & Wright, J. D. (2012). Grocery bag bans and foodborne illness. U of Penn, Inst for Law & Econ Research Paper, (13–2). Retrieved from Google Scholar.

Prockop, L. D., & Chichkova, R. I. (2007). Carbon monoxide intoxication: An updated review. Journal of the Neurological Sciences, 262(1–2), 122–130. doi:10.1016/j.jns.2007.06.037

Smith, A. H., Lingas, E. O., & Rahman, M. (2000). Contamination of drinking-water by arsenic in bangladesh: A public health emergency. Bulletin of the World Health Organization, 78(9), 1093–1103. Retrieved from Google Scholar.

Sobngwi, E. (2004). Exposure over the life course to an urban environment and its relation with obesity, diabetes, and hypertension in rural and urban cameroon. International Journal of Epidemiology, 33(4), 769–776. doi:10.1093/ije/dyh044

Wild, C. P. (2005). Complementing the genome with an “exposome”: The outstanding challenge of environmental exposure measurement in molecular epidemiology. Cancer Epidemiology, Biomarkers & Prevention : A Publication of the American Association for Cancer Research, Cosponsored by the American Society of Preventive Oncology, 14(8), 1847–1850. doi:10.1158/1055–9965.epi–05–0456

Churn Modelling – thoughts from the DRM Hangout

Churn Modelling – Notes from the Digital CRM Video Hangout

Raja Mitra has recently organised a very informative digital CRM web meeting where Dr Abhijit Sanyal delivered a lecture on predictive analytics and usage of predictive analytics of big data on the video. He described three case studies, one for banks how customer satisfaction and employee satisfaction were linked. The second case was of churn propensity modelling applied in Telecom sector (how to reduce cost of targetting using Churn propensity modelling).

Thoughts on Churn Modelling

Thinking about the churn modelling from Abhijit’s presentation, I was wondering if we can set up churn models for student churn through the university systems and patient churns through health systems, and then bring together this model through the design thinking aspects of doing a student or patient journey modelling. The churn modelling aspect was particularly informative.

“Churns”, or understanding why “clients”, after an initial engagement with the unit do not come back, or “fail to survive through the process” is particularly interesting in case of college admission units. For example, each year, the School of Health Sciences receives queries from hndreds of students; it’d be insightful to test what proportion of that student query actually end up getting admitted. Also, once being admitted, a certain percentage of the student move away from the university and move away. What is the attrition rate of differnt programmes and here too, churn modelling will have a role to play.

The worth of churn modelling in the educational sector such as a university admissions department will be similar to the Internet companies who will like to retain their customers, as holding on to established customer bases and trying to understand their behaviour is much more beneficial than acquiring new customers.

A similar issue works with the health services community. We need to know what is the “churn” of patients or clients who initially engage with the preventive or health promotive activities and then move away.

An Initial Ideas for churn modelling in Education and Health Sector — Methods

//To be filled in

Writing Manifesto

Notes on Using Authorea for Collaboration

First, a personal manifesto while waiting forhte first rays fo the day to come: I have decided to publish in Open Access publications as far as practicable (at least for those where I am the prinicpal author), use preprints exclusively for my presentations and all my publications and distribute the URLs of my preprints to my colleagues and students, and friends and network in academic media, and use plain text and web based writing media as far as practical. For other tools, the idea is a little divided. I prefer to use Open Access R for my research, and use LaTeX for my academic writing including slide production in Beamer, but these are not always possible. I also use and actually encourage people to use Stata for statistical data analysis, so the landscape is mixed. However, for publishing my work, I am increasingly leaning towards using collaboative publishing units such as Overleaf and Authorea, and use the PeerJ preprint servers, so I believe it’s a good time to write a few notes about these tools here. As the bibliography format is largely BibTeX, it does not matter what software I use, but I use Sente, but others can as easily be used. I have also started learning Python and use iPython notebooks, specifically using one at,and Juliabox, so these will come on as well.

Workflow I follow

Apply for funding –> fail/pass, do Research (some primary data analysis, some secondary data analysis and meta analysis) –> publish, teaching –> publish, and collaborate. Develop first drafts in Authorea, and then distribute as preprints for further soliciiting responses from colleagues and critics before sending it out for publishing.

Not all colleagues will understand the weight of plain text or will prefer Word. What then? Send them RTF converted or HTML converted files, take in their word marks and then incorporate and send out incorporating their changes, highlighting back to them (along with notes). More work, but worth it, and shows that I care for the changes that they have want me to work on. Also, keep everything on Authorea and invite them. What if Authorea dies? Or what if, overleaf dies? Keep the same process, but keep working in LaTeX document. If LaTeX is hard to learn, learn. It cannot be that bad.

Notes on Authorea

Authorea is intuitive. Basically start with a free account and start writing. There are a few minor tweaks that can make your life a lot easier to work with this medium though and save frustrations.

  1. Writing is simple, just plain text with markdown, and markdown is gorgeous. It has three levels of headings, bold, italic, support, standard table support (ideally markdown tables are good, and these can usually be prepared easily in the app itself). I prefer to wrie using markdown as this format is simple easy and intuitive. But it is possible to write using LaTeX as well.
  2. Add figures next to the text chunks.
  3. Speaking of chunks, it makes sense then to analyse everything in either R or Stata and then output in LaTeX and write back in Authorea.
  4. Create separate foo.tex files and add them to Authorea and then add the links to Authorea’s file
  5. Prepare a bibilo.bib file and keep the bibtex in it. Refer to the citations from there.
  6. Write in Scrivener and export from markdown to LaTeX and upload to Authorea for fine tuning
  7. Write in Authorea and keep copies in Github. That way, write everything in personal computer, save in a file in github
  8. Keep a copy in personal blog (this), otherwise push every large piece of thought in medium

Medium can be used for writing half finished blogs and for seeking peer review comments which can then be changed.

Draft: Notes on Writing and Publishing Workflow

Notes on Writing and Publishing Workflow

The world of scholarly publishing is changing. It’s almost like a new paradigm of publishing is opening before us. Open access publishing has made it easy for almost anyone with ideas to make it available to as wide an audience as one wants because the article or the idea itself is free to access by anyone.

Compared with toll access and its trappings, this suggests a new way of doing things. The question still persists as to what are some of the ways where people can publish beta ideas, make it open and invite potential scholars and friends to come and comment on it and take it from a stage of beta or alpha ideas to work till a publishable structure is established. It turns out that there are different ways of doing this. I found a few tools very interesting and intuitive to work with. There are social aspects to this publishing and these are also open for business in the sense it is possible to submit it to other people’s review and get more ideas and refine them. There are risks that people might steal your ideas, mix them and share them for their own fulfilment without paying you any attribution. Unfortunately that is a genuine risk. That said, here are some tools that can be used to build scholarly publishing workflow.

Collaborative Authoring Software

A couple of collaborative authoring software come to mind. Both are great. The first one that comes to mind is Authorea. Authorea is a free open website where anyone can register to obtain a free account and then get started to publish one’s own article.

There are two flavours to write. One of them is to write using Markdown, a typing process that is very intuitive and easy and uses plain text. I use plain text in almost all my writings and it synchronizes well with the email style of writing. You use simple texts to format your writing. For more information on markdown see John Gruber’s page on markdown formatted writing.

The other writing style that Authorea supports is LaTeX, a typesetting system that enables anyone to write again in plain text and typeset papers. I have found that the way to work best with Authorea is to invite your fellow authors or coauthors, and get them started with an account. Basically, you can start with an abstract or summary page, and then start with the process of writing. The tables in Authorea are based on LaTeX tables, which can sometimes be a little tedious to work with, unless one is familiar with typesetting LaTeX tables. The figures are easy to work with. Here is my working flow with Authorea:

1. I usually start setting up a file structure in the file that is contained I the folder space.

2. In the, I put in the sequence of the files (either latex files or markdown files that will be added to the main body of the text)

3. We keep writing and adding figures and tables as needed. (This is the bit I need to work on)

4. Keep the citations in the citation tab or search for citations within Authorea itself.

The next collaborating authoring tool that comes to mind for scientific scholarly writing is Overleaf. Overleaf is great and is based on LaTeX alone. The beauty of Overleaf is the number of templates and publishing opportunities and a nice formatting system that takes the tedium of publishing using LaTeX. In my opinion, this would be a killer app if we had some way of integrating Markdown in it for publishing. It’d be similarly nice if Authorea were to pay attention to the templating or direct publishing to open access or toll access publishing journals. Google Docs for that matter is also excellent in allowing real time collaboration where two individuals can simultaneously work on a document. Google Docs also tightly integrates with Google Scholar, a great source of literature and bibliographic information and also indeed with the search facility in general. However, I am not sure the extent to which Google Scholar can be easily used for typesetting ideas and expressions and tightly integrating with Epidemiology and statistical workflow.

Publishing Workflow

Two shining examples are PeerJ and my new found The Winnower. PeerJ in particular shines with its preprint system where an article or indeed any piece of work can be hosted and others can be invited to come and view and comment on it, and then the Winnower system goes a step ahead in letting people directly publish, review, and comment and then revise the manuscript based on these comments. It’s a world where your publishing and scholarly communication workflow can be integrated with crowd ideas and comments and can be further embellished.