The Sustainable Development Goals Agenda


How are the Sustainable Development Goals different from the MDGs?
The 17 Sustainable Development Goals (SDGs) with 169 targets are broader in scope and go further than the MDGs by addressing the root causes of poverty and the universal need for development that works for all people. The goals cover the three dimensions of sustainable development: economic growth, social inclusion, and environmental protection. Building on the success and momentum of the MDGs, the new global goals cover more ground, with ambitions to address inequalities, economic growth, decent jobs, cities and human settlements, industrialization, oceans, ecosystems, energy, climate change, sustainable consumption and production, peace and justice. The new Goals are universal and apply to all countries, whereas the MDGs were intended for action in developing countries only. A core feature of the SDGs is their strong focus on means of implementation—the mobilization of financial resources—capacity-building and technology, as well as data and institutions. The new Goals recognize that tackling climate change is essential for sustainable development and poverty eradication. SDG 13 aims to promote urgent action to combat climate change and its impacts.

 

Advertisements

Jupyter Notebook tips, tricks, and shortcuts


Jupyter Notebook

Jupyter notebook, formerly known as the IPython notebook, is a flexible tool that helps you create readable analyses, as you can keep code, images, comments, formula, and plots together. Jupyter is quite extensible, supports many programming languages and is easily hosted on your computer or on almost any server — you only need to have ssh or http access. Best of all, it’s completely free.

Project Jupyter was born out of the IPython project as the project evolved to become a notebook that could support multiple languages – hence its historical name as the IPython notebook. The name Jupyter is an indirect acronym of the three core languages it was designed for: Julia, PYThon, and R , and is inspired by the planet Jupiter.

When working with Python in Jupyter, the IPython kernel is used, which gives us some handy access to IPython features from within our Jupyter notebooks (more on that later!)

 

Keyboard Shortcuts

As any power user knows, keyboard shortcuts will save you lots of time. Jupyter stores a list of keyboard shortcuts under the menu at the top: Help > Keyboard Shortcuts, or by pressing H in command mode (more on that later). It’s worth checking this each time you update Jupyter, as more shortcuts are added all the time.

Another way to access keyboard shortcuts and a handy way to learn them is to use the command palette: Cmd + Shift + P (or Ctrl + Shift + P on Linux and Windows). This dialog box helps you run any command by name – useful if you don’t know the keyboard shortcut for an action or if what you want to do does not have a keyboard shortcut. The functionality is similar to Spotlight search on a Mac, and once you start using it you’ll wonder how you lived without it!

 

Some of my favorites ones are presented below:

  • Esc will take you into command mode where you can navigate around your jupyter notebook with arrow keys no matter the types of computers on hand ( Linux or windows).
  • While in command mode:
    • A to insert a new cell above the current cell, B to insert a new cell below.
    • M to change the current cell to Markdown, Y to change it back to the code
    • D + D (press the key twice) to delete the current cell
  • Enter  will take you from command mode back into edit mode for the given cell.
  • Shift + Tab will show you the Docstring (documentation) for the object you have just typed in a code cell – you can keep pressing this shortcut to cycle through a few modes of documentation.
  • Ctrl + Shift + - will split the current cell into two from where your cursor is.
  • Esc + F Find and replace on your code but not the outputs.
  • Esc + O Toggle cell output.
  • Selecting  Multiple Cells at once use the Following handy methods:
    • Shift + J or Shift + Down selects the next sell in a downwards direction. You can also select sells in an upwards direction by using Shift + K or Shift + UpOnce cells are selected, you can then delete / copy / cut / paste / run them as a batch. This is helpful when you need to move parts of a notebook. You can also use Shift + M to merge multiple cells at once. This is super crazy cool right.

      I will come back with other handy methods i use every day when i play with data.  Until next enjoy reading and leave comments, please.

 

Odds Ratio Explanation


Generalized Linear Models (GLM)

What is an odds ratio?

An odds ratio (OR) is a measure of association between an exposure and an outcome. The OR represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure. Odds ratios are most commonly used in case-control studies, however, they can also be used in cross-sectional and cohort study designs as well (with some modifications and/or assumptions).

Odds ratios and logistic regression

When a logistic regression is calculated, the regression coefficient (b1) is the estimated increase in the log odds of the outcome per unit increase in the value of the exposure. In other words, the exponential function of the regression coefficient (eb1) is the odds ratio associated with a one-unit increase in the exposure.

When is it used?

Odds ratios are used to compare the relative odds of the occurrence of the outcome of interest (e.g. disease or disorder), given exposure to the variable of interest (e.g. health characteristic, the aspect of medical history). The odds ratio can also be used to determine whether a particular exposure is a risk factor for a particular outcome and to compare the magnitude of various risk factors for that outcome.

  • OR =1 Exposure does not affect odds of the outcome
  • OR >1 Exposure associated with higher odds of the outcome
  • OR <1 Exposure associated with lower odds of the outcome

What about confidence intervals?

The 95% confidence interval (CI) is used to estimate the precision of the OR. A large CI indicates a low level of precision of the OR, whereas a small CI indicates a higher precision of the OR. It is important to note, however, that unlike the p-value, the 95% CI does not report a measure’s statistical significance. In practice, the 95% CI is often used as a proxy for the presence of statistical significance if it does not overlap the null value (e.g. OR=1). Nevertheless, it would be inappropriate to interpret an OR with 95% CI that spans the null value as indicating evidence for lack of association between the exposure and outcome.

Confounding

When a non-casual association is observed between a given exposure and outcome is as a result of the influence of a third variable, it is termed confounding, with the third variable termed a confounding variable. A confounding variable is causally associated with the outcome of interest, and non-causally or causally associated with the exposure, but is not an intermediate variable in the causal pathway between exposure and outcome (Szklo & Nieto, 2007). Stratification and multiple regression techniques are two methods used to address confounding and produce “adjusted” ORs.

Example

Data from an article published in the Journal in November 2008 will be used to illustrate how ORs (A) and 95% CIs (B) are calculated In their article, Greenfield and colleagues looked at previously suicidal adolescents (n=263) and used logistic regression to analyze the associations between baseline variables such as age, sex, presence of psychiatric disorder, previous hospitalizations, and drug and alcohol use, with suicidal behaviour at six-month follow-up (Greenfield et al., 2008).

A) Calculating Odds Ratios

We will calculate odds ratios (OR) using a two-by-two frequency table

An external file that holds a picture, illustration, etc. Object name is ccap19_3p227f1.jpg

Where

  • a = Number of exposed cases
  • b = Number of exposed non-cases
  • c = Number of unexposed cases
  • d = Number of unexposed non-cases
OR=a|cb|d=adbc
OR=(n) exposed cases/(n) unexposed cases(n) exposed non -cases/(n) unexposed non -cases=(n) exposed cases×(n) unexposed non-cases(n) exposed non -cases×(n) unexposed cases

In the study, 186 of the 263 adolescents previously judged as having experienced a suicidal behavior requiring immediate psychiatric consultation did not exhibit suicidal behavior (non-suicidal, NS) at six months follow-up. Of this group, 86 young people had been assessed as having depression at baseline. Of the 77 young people with persistent suicidal behavior at follow-up (suicidal behavior, SB), 45 had been assessed as having depression at baseline.

What is the OR of suicidal behavior at six months follow-up given presence of depression at baseline?

First we determine the numbers to use for (a), (b), (c), (d)

  • a: Number of exposed cases (+ +) = ?
  • b: Number of exposed non-cases (+ –) = ?
  • c: Number of unexposed cases (– +) = ?
  • d: Number of unexposed non-cases (– –) = ?

Q1: Who are the exposed cases (++ = a)?

A1: Youth with persistent SB assessed as having depression at baseline

a=45

Q2: Who are the exposed non-cases (+ – = b)?

A2: Youth with no SB at follow-up assessed as having depression at baseline b=86

Q3: Who are the unexposed cases (– + = c)?

A3: Youth with persistent SB not assessed as having depression at baseline

c: 77(SB) –45(depression) = 32

Q4: Who are the unexposed non-cases (– – = d)?

A4: Youth with no SB at follow-up not assessed as having depression at baseline

d: 186(NS) –86(depression) = 100

Then we plug the values into the formula

  • a: Number of exposed cases (++) = 45
  • b: Number of exposed non-cases (+ –) = 86
  • c: Number of unexposed cases (– +) = 32
  • d: Number of unexposed non-cases (– –) = 100
OR=a|cb|d=adbc=45/3286/100=1.63

Thus, the odds of persistent suicidal behavior is 1.63 higher given baseline depression diagnosis compared to no baseline depression.

B) Calculating 95% confidence intervals

What are the confidence intervals for the OR calculated above?

Confidence intervals are calculated using the formula shown below

Upper 95% CI=e^[ln (OR)+1.96 (1/a+1/b+1/c+1/d)]
Lower 95% CI=e^[ln (OR)1.96 (1/a+1/b+1/c+1/d)]

Plugging in the numbers from the table above, we get:

Upper 95% CI=e^[ln (OR)+1.96 (1/45+1/86+1/32+1/100)]=2.80
Lower 95% CI=e^[ln (OR)1.96 (1/45+1/86+1/38+1/100)]=0.96

Since the 95% CI of 0.96 to 2.80 spans 1.0, the increased odds (OR 1.63) of persistent suicidal behavior among adolescents with depression at baseline does not reach statistical significance. In fact, this is indicated in Table 1 of the reference article, which shows a p-value of 0.07.

Interestingly, the odds of persistent suicidal behavior in this group given the presence of borderline personality disorder at baseline was twice that of depression (OR 3.8, 95% CI:1.6–8.7), and was statistically significant (p 0.002)

This example illustrates a few important points. First, the presence of a positive OR for an outcome given a particular exposure does not necessarily indicate that this association is statistically significant. One must consider the confidence intervals and p-value (where provided) to determine significance. Second, while the psychiatric literature shows that overall, depression is strongly linked to suicide and suicide attempt (Kutcher & Szumilas, 2009), in a particular sample, with a particular size and composition, and in the presence of other variables, the association may not be significant.

Understanding odds ratios, how they are calculated, what they mean, and how to compare them is an important part of understanding scientific research.

Conceptual Statistical Overview


  • Conceptual overview of the (arithmetic) mean (i.e., average) and median and Std:
    mean: the average of all numbers
    median: middle number in the list of all numbers
    mode: most frequent number
    std: the measure of how spread out numbers are

Forget P-values? or just let it be what it is

P-value has always been controversial. It is required for certain publications, banned from some journals, hated by many, yet quoted widely. Not all p-values are loved equally. Because of a “rule” popularized some 90 years ago, small values below 0.05 have been the crowd’s favorite.

When teaching hypothesis testing, we explain that the entire spectrum of p-value serves a single purpose: quantifying the “agreement” between an observed set of data and a statement (or claim) known as the null hypothesis.

Why are we “obsessed” with the small values then? Why can’t we talk about any specific p-value the same way we talk about today’s temperature? i.e., as a measure of something.

First of all, the scale of the p-value is hard to talk about. This is different from temperature. The difference between 0.21 and 0.20 is not the same as 0.02 and 0.01.

It almost feels like we should use the reciprocal of a p-value to discuss the likeliness of the corresponding observed “data” (represented by a summary/test statistic), assuming the null hypothesis is true.

If the null hypothesis is true, it takes, on average, 100 independent tests to observe a p-value below 0.01. The occurrence of a p-value under 0.02 is twice as likely, taking only about 50 tests to observe. Therefore 0.01 is twice as unlikely as 0.02. Using a similar calculation, 0.21 and 0.20 are almost identical in terms of likeliness under the null.

In Introductory Statistics, we teach that a test of significance has four steps:

  1. stating the hypotheses and a desired level of significance;
  2. computing the test statistics;
  3. finding the p-value;
  4. concluding given the p-value. 

It is step 4 that requires us to draw a line somewhere on the spectrum of p-value between 0 and 1. That line is known as the level of significance.

I never enjoyed explaining how one would choose a level of significance. Many of my students felt confused. Technically speaking, if a student derived a p-value of 0.28, she can claim it is significant at a significance level of 0.30. The reason why this is silly is that a chosen significance level should convey a certain sense of rare occurrence: so rare that it is deemed contradictory with the null hypothesis. No one of common sense would argue a chance that is close to 1 out of 3 represents rarity.

What common sense fails to deliver is how rare is contradictory enough. A recent HBR article showed that people have a wide variation in how they perceive a concept of likelihood such as “rare”.

The solution?

“Use probabilities instead of words to avoid misinterpretation.” 

P-value and significance level serve precisely this purpose.

Why 1/20 needs to be a universal choice? It doesn’t. Statisticians are not quite bothered by “insignificant results” as we think 0.051 is just as interesting as 0.049. We, whenever possible, always just report the actual p-value instead of stating that we reject/accept the null hypothesis at a certain level. We use p-value to quantify the strength of evidence between variables and studies.

However, sometimes we don’t have a choice so we got creative.

For any particular test between a null hypothesis and an alternative, a representative (i.e., not with selection bias) sample of p-values will offer a much better picture than the current published record of a handful of p-values under 0.05 out of who-knows-how-many trials. There have been suggestions on publishing insignificant results to avoid the so-called “cherry-picking” based on p-values. Despite the apparent appeal of such a reform, I cannot imagine it is practically possible. First of all, if we can assume that most people have been following the 0.05 “rule”, publishing all the insignificant results will result in a 20-fold increase in the number of published studies. Yet it probably will create a very interesting dataset for data mining. What would be useful is to have a public database of p-values on repeated studies of the precisely same test (not just the null hypothesis, as the test depends on what is the alternative as well). In such a database, a p-value can finally be just what it is, a measure of agreement between a data set and a claim