Introduction To Statistical Testing

Author

Deebul Nair

Published

June 15, 2023

Introduction To Statistical Testing

Introduction To Statistical Tests

Below are the slides I created for introducting statistial testing to master students of Autonomous Systems group in Hochschule Bonn-Rhein-Sieg

import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt

Learning Goals

  • Research Question vs Research Hypothesis vs Statistical Hypothesis
  • Why Statistical Test
  • Intro Null Hypothesis Statistical Testing
  • Descriptive Statistics
  • Bunch of Tests

Null Hypothesis Statistical Testing (NHST)?

Power Analysis

References: [1] Guy Hoffman and Xuan Zhao. 2020. A Primer for Conducting Experiments in Human–Robot Interaction. J. Hum.-Robot Interact. 10, 1, Article 6 (March 2021), 31 pages. https://doi.org/10.1145/3412374

This Single Statement Generates Great Robotic Research [1]

  • [1] https://blog.robotiq.com/this-single-statement-generates-great-robotics-research

A hypothesis, of course. Writing a hypothesis is a key skill for experimental design.

What is a Hypothesis Anyway?

Simply put, a hypothesis is a statement which makes a prediction about something which is not proven. It is a kind of educated guess.

You need a hypothesis when you want to prove something through experimentation.

  • Robotics is a little different from the natural sciences, because it is primarily an engineering field.
  • The Engineering Process is different from the Scientific Method, because it doesn’t usually make hypotheses and conduct experiments to prove them.
  • Instead, it defines design criteria and then develops technologies which achieve those criteria.

Most of the research work in robotics is spent on developing new technologies and methodologies. However, even though the primary purpose of your research is not to prove a hypothesis, as it is in natural science research, you still need a hypothesis to conduct experiments which prove that your developments work as well as you claim they do.

Therefore, a typical robotics hypothesis might look something like:

"Our algorithm calculates the solution faster than a naive algorithm can calculate the same problem."

Examples of Good Hypothesis from Robotics

1. Multi-Robot Grasp Planning for Sequential Assembly Operations (Dogar et al 2015)

In this paper, the authors use multiple mobile manipulators to construct a flat-packed chair with or without re-grasps (i.e. putting down the piece and picking it up in a different way).

The hypothesis can be more or less expressed like this:

Our algorithm calculates a good enough manipulation solution with few re-grasps faster than a naive algorithm can calculate the same problem optimally (i.e. with no re-grasps).

Bad Hypothesis

For the sake of demonstration, here is an example of what a less specific hypothesis might look like, one which does not define variables. It is completely unspecific as to what “quickly” means:

Our algorithm can calculate a solution quickly.

Another problem might arise if the hypothesis does not include testability. For example, this hypothesis would not be testable, because you could never test all of the other algorithms:

Our algorithm calculates a solution better than all of the other planning algorithms.

Example 2: Grasping without Squeezing: Shear Adhesion Gripper with Fibrillar Thin Film (Hawkes et al 2015)

It presents a new type of gripper which, as the authors explain, does not use normal forces to pick up objects as most robotic grippers do. Instead, the gripper uses shear forces which are applied to the manipulated object via a flexible film.

If a gripper uses shear adhesion forces then it will be a viable option for robotic grasping of objects with a large radius of curvature.

Task 1

  • Write one hypothesis of you RnD/Thesis .

  • Discussion

Research Hypothesis vs Statistical Hypothesis?

  • Statistical hypotheses must be mathematically precise and they must correspond to specific claims about the characteristics of the data generating mechanism (i.e., the “population”).

  • Even so, the intent is that statistical hypotheses bear a clear relationship to the substantive research hypotheses that you care about.

  • Our hypothesis can be more or less expressed like this: > Our algorithm calculates a good enough manipulation solution with few re-grasps faster than a naive algorithm can calculate the same problem optimally (i.e. with no re-grasps).

  • So let’s think about what our statistical hypothesis statement would be.

  • The quantity that we are interested in within the experiment are \(P(regrasps)_{old}\) and \(P(regrasps)_{new}\), the true-but-unknown probability with which the robot re-grasps in the experiment.

\[P(regrasps)_{old} > P(regrasps)_{new}\]

Task 2

Convert your hypothesis from above to statistical hypothesis

Null Hypothesis Statistical Testing (NHST)?

Research Questions and Data Examples

Experiment 1 : Weight Gain

References:

  • Moore, D. S., McCabe, G. P., and Craig, B. A. (2012). Introduction to the Practice of Statistics (7th ed.). New York: Freeman.
  • Levine, J. A., Eberhardt, N. L., and Jensen, M. D. (1999) Role of nonexcercise activity thermogenesis in resistance to fat gain in humans. Science, 283:212-214.

Description:

“Weight Gain”, provides weights of 16 participants before and after an eight-week period of excessive calorie intake (Moore et al., 2012, p. 425).

Hypothesis:

We will test the hypothesis that 1000 excess calorie intake per day over 8 weeks results in 16 pounds (approximately 7.2 kilograms) weight increase.

Variables:

Weight Before - Weight in pounds (lb) measured before eight weeks of excessive calorie intake.
Weight After - Weight in pounds (lb) measured after eight weeks of excessive calorie intake.
Difference - Weight After - Weight Before.
weight_gain_data = pd.read_csv('https://raw.githubusercontent.com/jasp-stats/jasp-desktop/4527546659fdbd0849261f22ec84e9db76adc49d/Resources/Data%20Sets/Data%20Library/2.%20T-Tests/Weight%20Gain.csv')
weight_gain_data.head()
Weight Before Weight After Difference
0 122.54 135.74 13.20
1 120.78 129.36 8.58
2 131.12 145.20 14.08
3 137.06 145.64 8.58
4 163.24 173.80 10.56

Experiment 2 : Response to Eye Color

Description:

“Response to Eye Color”, provides post-advertisement attitudes towards a brand expressed by four different groups - each group saw the same advertisement except for the aspect that was manipulated: the eye-color of the model.

Hypothesis:

The null hypothesis that the attitudes are the same regardless of the eye-color of the model

Variables:

Group - Experimental conditions (`Blue' = Model with blue eyes, `Brown' = Model with brown eyes, `Green'= Model with green eyes, `Down' = Model's eye color cannot be seen).
Subj - Participant number.
Score - An average of 10 survey questions about attitudes towards the brand (7-point Likert scale). Higher averages correspond to more positive attitudes.

Data

eye_color_data = pd.read_csv('https://raw.githubusercontent.com/jasp-stats/jasp-desktop/4527546659fdbd0849261f22ec84e9db76adc49d/Resources/Data%20Sets/Data%20Library/3.%20ANOVA/Response%20to%20Eye%20Color.csv')
eye_color_data.head()
Group Subj Score
0 Blue 1 1.3
1 Blue 2 1.0
2 Blue 3 7.0
3 Blue 4 4.2
4 Blue 5 5.4

References:

Moore, D. S., McCabe, G. P., and Craig, B. A. (2012). Introduction to the Practice of Statistics (7th ed.). New York: Freeman.

Simpson, P. M., Sturges, D. L., and Tanguma, J. (2008). The eyes have it, or do they? The effects of model eye color and eye gaze on consumer as response. The Journal of Applied Business and Economics, 8: 60-72.

Experiment 3 : Brain Size and Intelligence

Reference

Willerman L, Schultz R, Rutledge JN, Bigler ED. In vivo brain size and intelligence. Intelligence. 1991 Apr 1;15(2):223-8.

Description

In this study by Willerman et al. (1991) the researchers use Magnetic Resonance Imaging (MRI) to determine the brain size of the subjects. The researchers take into account gender and body size to draw conclusions about the connection between brain size and intelligence.

Hypothesis

Are the size and weight of your brain indicators of your mental capacity?

Variable Names:

  1. Gender: Male or Female
  2. FSIQ: Full Scale IQ scores based on the four Wechsler (1981) subtests
  3. VIQ: Verbal IQ scores based on the four Wechsler (1981) subtests
  4. PIQ: Performance IQ scores based on the four Wechsler (1981) subtests
  5. Weight: body weight in pounds
  6. Height: height in inches
  7. MRI_Count: total pixel Count from the 18 MRI scans
brain_data = pd.read_csv('https://scipy-lectures.org/_downloads/brain_size.csv', sep=';', na_values=".")
brain_data.head()
Unnamed: 0 Gender FSIQ VIQ PIQ Weight Height MRI_Count
0 1 Female 133 132 124 118.0 64.5 816932
1 2 Male 140 150 124 NaN 72.5 1001121
2 3 Male 139 123 150 143.0 73.3 1038437
3 4 Male 133 129 128 172.0 68.8 965353
4 5 Female 137 132 134 147.0 65.0 951545

# Statistical Testing

1. Descriptive Statistics

  • counts
  • mean
  • median
  • max
  • min
  • ….

brain_data.describe()
Unnamed: 0 FSIQ VIQ PIQ Weight Height MRI_Count
count 40.000000 40.000000 40.000000 40.00000 38.000000 39.000000 4.000000e+01
mean 20.500000 113.450000 112.350000 111.02500 151.052632 68.525641 9.087550e+05
std 11.690452 24.082071 23.616107 22.47105 23.478509 3.994649 7.228205e+04
min 1.000000 77.000000 71.000000 72.00000 106.000000 62.000000 7.906190e+05
25% 10.750000 89.750000 90.000000 88.25000 135.250000 66.000000 8.559185e+05
50% 20.500000 116.500000 113.000000 115.00000 146.500000 68.000000 9.053990e+05
75% 30.250000 135.500000 129.750000 128.00000 172.000000 70.500000 9.500780e+05
max 40.000000 144.000000 150.000000 150.00000 192.000000 77.000000 1.079549e+06

Mean vs Median

  • Learning Statistics with JASP - Page 61

Inter Quantile Region

  • The interquartile range (IQR) is like the range, but instead of the difference between the biggest and smallest value the difference between the 25th percentile and the 75th percentile is taken.
  • Percentile : Arange the data in asceding order, then 25th percentile is 25% data is less than this value
  • In fact, we’ve already come across the idea. The median of a data set is its 50th percentile!

Standard Deviation

  • In general, you should expect 68% of the data to fall within 1 standard deviation of the mean,

  • 95% of the data to fall within 2 standard deviation of the mean, and

  • 99.7% of the data to fall within 3 standard deviations of the mean.

  • This rule tends to work pretty well most of the time, but it’s not exact.

  • It’s actually calculated based on an assumption that the histogram is symmetric and “bell shaped”.

* Learning Statistics with JASP - Page 64

Plotting

  • Try plotting the data to get a complete picture

Matejka, J., & Fitzmaurice, G.W. (2017). Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing. Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems.

Plots

  1. Scatter plot
  2. Box PLot
  3. Density Plots
  4. Rain cloud plots

Scatter Plot

# Scater plot 
#todo
plt.scatter(range(weight_gain_data.Difference.size), weight_gain_data.Difference)

plt.scatter( weight_gain_data.Difference, np.zeros(weight_gain_data.Difference.size))

Box Plot

#Box plot
sns.boxplot(x=weight_gain_data["Difference"])

Box plot

  • https://chartio.com/learn/charts/box-plot-complete-guide/
sns.boxplot(data=weight_gain_data[["Weight Before", "Weight After"]])

sns.violinplot(data=weight_gain_data[["Weight Before", "Weight After"]])

Density plot

sns.displot(data=weight_gain_data, x="Weight Before")

sns.displot(data=weight_gain_data, x="Weight Before", kde=True)

sns.displot(data=weight_gain_data[["Weight Before", "Weight After"]], kind='kde')

Do you remember any other plot seeing this plot ?

Some plots lie !

Matejka, J., & Fitzmaurice, G.W. (2017). Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing. Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems.

Rain cloud Plot

  • Combining all the thre plots together.
  • scatter + density + box plot

Why statistical test why not compare mean?

  • For example our Weight Gain Dataset.
  • Mean(weight before ) = 144.63
  • Mean(weight after) = 155.045
  • Difference of Means = 10.408

Can you conclude that there is conclusive evidence that the weights increase with the diet ?

The fundamental question underlying statistical analysis:

estimating the amount of evidence in support of our hypothesis, even in the presence of noise.

weight_gain_data.describe()
Weight Before Weight After Difference
count 16.000000 16.000000 16.000000
mean 144.636250 155.045000 10.408750
std 22.704877 21.438057 3.840639
min 117.260000 129.360000 3.080000
25% 126.115000 137.885000 7.700000
50% 138.270000 147.840000 11.110000
75% 161.480000 170.335000 13.420000
max 201.740000 204.820000 15.620000

Statistical Testing for Evaluating Experiments

  • Deals with the relationship between the value of data, its variance, and the confidence of a conclusion

A typical situation:

  • Existing technique A
  • You developed a new technique B
  • Key question: Is B better than A?

Evaluating Experiments

1. Define a performance measure, e.g.

  • Run-time
  • Error
  • Accuracy
  • Robustness (success rate, MTBF, …)

2. Collect data d

  • Run both techniques on the data d
  • How to compare the obtained results \(A(d), B(d)\)?

1st Example

Scenario

  • A, B are two HRI techniques
  • Score is the audience response time
  • Data d is a given map, start and goal pose

Example

  • A(d) = 0.5 s
  • B(d) = 0.6 s What does that mean?

2nd Example

  • Same scenario but four tasks

Example

  • A(d) = 0.5 s, 0.4 s, 0.6 s, 0.4 s
  • B(d) = 0.4 s, 0.3 s, 0.6 s, 0.5 s

Mean of the planning time is

  • \(μ_A\) = 1.9 s/4 = 0.475 s
  • \(μ_B\) = 1.8 s/4 = 0.45 s

Is B really better than A?

Is B better than A?

  • \(μ_A\) = 0.475 s, \(μ_B\) = 0.45 s
  • \(μ_A\) > \(μ_B\), so B is better than A?!
  • We just evaluated four tests, thus \(μ_A\) and \(μ_B\) are rough estimates only
  • We saw too few data to make statements with high confidence
  • How can we make a confident statement that B is better than A?

Hypothesis Testing

“Answer a yes-no question about a population and assess that the answer is wrong.” [Cohen’ 95]

  • Example: To test that B is different from A, assume they are truly equal.

Then, assess the probability of the obtained result. If the probability is small, reject the hypothesis.

The Null Hypothesis \(H_0\)

  • The null hypothesis is the hypothesis that one wants to reject by analyzing data (from experiments)
  • \(H_0\) is the default state
  • A statistical test can never proof \(H_0\)
  • A statistical test can only reject or fail to reject \(H_0\)
  • Example: to show that method A is better than B, use \(H_0\): A=B

Typical/Popular Null Hypotheses

Example 1 : Comparing one sample to a known or hypothesized population mean.

\[ H_0 : \mu = 0 \] $ H_1 : $ (two-tailored test )

$ H_1 : < 0 $ (one-tailored test )

$ H_1 : > 0 $ (one-tailored test )

Example 2 : Testing relationships between two or more variables

\[ H_0 : \mu_1 = \mu_2 \] $ H_1 : _1 _2 $ (two-tailored test )

$ H_1 : _1 < _2 $ (one-tailored test )

$ H_1 : _1 > _2 $ (one-tailored test )

P Value

Learning Statistics with JASP, Page 169

Statistical Tests

Student’s t-tests

One sample location test

Research is normally carried out in sample populations, but how close does the sample reflect the whole population? The parametric one-sample t-test determines whether the sample mean is statistically different from a known or hypothesized population mean. The null hypothesis (Ho) tested is that the sample mean is equal to the population mean.

  • Given a µ and σ of a population
  • Test if a sample (from the population) has a significantly different mean than the population
  • Sample of size N

\[ H_0 : \mu = K \] $ H_1 : K $ (two-tailored test )

$ H_1 : < K $ (one-tailored test )

$ H_1 : > K $ (one-tailored test )

1-sample t-test: testing the value of a population mean

scipy.stats.ttest_1samp() tests if the population mean of data is likely to be equal to a given value (technically if observations are drawn from a Gaussian distributions of given population mean). It returns the T statistic, and the p-value (see the function’s help):

from scipy import stats
stats.ttest_1samp(brain_data['VIQ'], 0)   
TtestResult(statistic=30.08809997084933, pvalue=1.3289196468727879e-28, df=39)

Reporting/Conclusion

With a p-value of 10^-28 we can claim that the population mean for the IQ (VIQ measure) is not 0.

One Sample t-Test: Assumptions

  • Independently generated samples
  • The population distribution is Gaussian (otherwise the t-distribution is not the correct choice)
  • Mean is known

Comments

  • The t-Test is quite robust under non-Gaussian distributions
  • Often a 95% or 99% confidence (=5% or 1% significance) level is used
  • t-Test is one of the most frequently used tests in science

BINOMIAL TEST

  • The binomial test is effectively a non-parametric version of the one-sample t-test for use with dichotomous (i.e. yes/no) categorical datasets.
  • This tests whether or not the sample frequency is statistically different from a known or hypothesized population frequency.
  • The null hypothesis (Ho) tested is that the sample data frequency is equal to the expected population frequency.

ASSUMPTIONS

Three assumptions are required for a binomial test to provide a valid result: * The test variable should be a dichotomous scale (such as yes/no, male/female etc.). * The sample responses should be independent * The sample size is less, but representative of the population

Example : Laptop

  • Open binomial.csv, this contains one column of data showing the number of students using either a Windows laptop or a MacBook at University.

  • In January 2018, when comparing just the two operating systems, the UK market share of Windows was 86% and Mac IOS 14%.

  • Null hypothesis \(H_0 = 0.86\)

  • Alternate hypothesis \(H_0 \neq 0.86\)

laptop_data = pd.read_csv('Binomial.csv')
laptop_data
laptop_data.describe()
Laptop
count 90
unique 2
top Windows
freq 54
laptop_data.value_counts()
Laptop 
Windows    54
Mac        36
dtype: int64
result = stats.binomtest(54, n=90, p=0.86, alternative='two-sided')
print ('pvalue : ', result.pvalue)
print ('95% confidence level' ,result.proportion_ci(confidence_level=0.95))
print ('proportion estimate ', result.proportion_estimate)
pvalue :  1.1957465473118124e-09
95% confidence level ConfidenceInterval(low=0.4913413183034394, high=0.7018860844296883)
proportion estimate  0.6

REPORTING THE RESULTS

The UK proportion of Windows and MacBook users was reported to be 86% and 14% respectively. In a cohort of University students (N=90), a Binomial test revealed that the proportion of students using Windows laptops was significantly less (60%, p<.001) and those using MacBooks significantly more (40.4%, p<.001) than expected.

Two Sample t-Test

  • Often, one wants to compare the means of two samples to see if both are drawn from populations with equal means
  • Example: Compare two estimation procedures (operating on potentially different data sets)
  • Typical Hypotheses \[ H_0 : \mu_1 = \mu_2 \] $ H_1 : _1 _2 $ (two-tailored test )

$ H_1 : _1 < _2 $ (one-tailored test )

$ H_1 : _1 > _2 $ (one-tailored test )

  • This test produces a t-score which is a ration of the differences between the two groups and the differences within the two groups:
  • A large t-score indicates that there is a greater difference between groups. The smaller the t-score, the more similarity there is between groups.
  • A t-score of 5 means that the groups are five times as different from each other as they are within each other.
  • The null hypothesis (Ho) tested is that the population means from the two unrelated groups are equal

2-sample t-test: testing for difference across populations

We have seen above that the mean VIQ in the male and female populations were different. To test if this is significant, we do a 2-sample t-test with scipy.stats.ttest_ind():

female_viq = brain_data[brain_data['Gender'] == 'Female']['VIQ']

male_viq = brain_data[brain_data['Gender'] == 'Male']['VIQ']

stats.ttest_ind(female_viq, male_viq)  
Ttest_indResult(statistic=-0.7726161723275012, pvalue=0.44452876778583217)

Paired tests: repeated measurements on the same individuals

PIQ, VIQ, and FSIQ give 3 measures of IQ. Let us test if FISQ and PIQ are significantly different. We can use a 2 sample test:

stats.ttest_ind(brain_data['FSIQ'], brain_data['PIQ'])   
Ttest_indResult(statistic=0.465637596380964, pvalue=0.6427725009414841)

The problem with this approach is that it forgets that there are links between observations: FSIQ and PIQ are measured on the same individuals. Thus the variance due to inter-subject variability is confounding, and can be removed, using a “paired test”, or “repeated measures test”:

stats.ttest_rel(brain_data['FSIQ'], brain_data['PIQ'])   
TtestResult(statistic=1.7842019405859857, pvalue=0.08217263818364236, df=39)

One-way ANOVA

  • Whereas t-tests compare the means of two groups/conditions, one-way analysis of variance (ANOVA) compares the means of 3 or more groups/conditions.
  • The null hypothesis tested is that there is no significant difference between the means of all the groups. If the null hypothesis is rejected,.
  • ANOVA just states that there is a significant difference between the groups but not where those differences occur. * To determine where the group differences are, post hoc (From the Latin post hoc, “after this”) tests are subsequently used.

\[H_0 : \mu_1 = \mu_2 = \mu_n \]

ASSUMPTIONS

The independent ANOVA makes the same assumptions as most other parametric tests. * The independent variable must be categorical and the dependent variable must be continuous. * The groups should be independent of each other. * The dependent variable should be approximately normally distributed. * There should be no significant outliers. * There should be homogeneity of variance between the groups otherwise the p-value for the F-statistic may not be reliable.

  • The first 2 assumptions are usually controlled through the use of appropriate research method design.
  • If the last three assumptions are violated then the non-parametric equivalent, Kruskal-Wallis should be considered instead.

Data : ANOVA diets.csv. * This contains A column containing the 3 diets used (A, B and C) * another column containing the absolute amount of weight loss after 8 weeks on one of 3 differentdiets. * For good practice check the descriptive statistics and the boxplots for any extreme outliers

diet_loss_data = pd.read_csv('Independent ANOVA diets.csv')
diet_loss_data
Diet Weight loss kg
0 Diet A 3.8
1 Diet A 6.0
2 Diet A 0.7
3 Diet A 2.9
4 Diet A 2.8
... ... ...
67 Diet C 2.8
68 Diet C 4.1
69 Diet C 5.3
70 Diet C 9.2
71 Diet C 6.1

72 rows × 2 columns

sns.boxplot(data=diet_loss_data, x='Diet', y='Weight loss kg')

stats.f_oneway(diet_loss_data[diet_loss_data['Diet'] == 'Diet A']['Weight loss kg'],
               diet_loss_data[diet_loss_data['Diet'] == 'Diet B']['Weight loss kg'],
               diet_loss_data[diet_loss_data['Diet'] == 'Diet C']['Weight loss kg'])
F_onewayResult(statistic=10.825519115627964, pvalue=8.145779232433143e-05)

Results

The main ANOVA table shows that the F-statistic is significant (p<.001) and that there is a large effect size. Therefore, there is a significant difference between the means of the 3 diet groups.

Post hoc Anlysis : tukey_hsd

  • Tukey’s honestly significant difference (HSD) test performs pairwise comparison of means for a set of samples.

  • Whereas ANOVA (e.g. f_oneway) assesses whether the true means underlying each sample are identical, Tukey’s HSD is a post hoc test used to compare the mean of each sample to the mean of each other sample.

  • The null hypothesis is that the distributions underlying the samples all have the same mean.

  • The test statistic, which is computed for every possible pairing of samples, is simply the difference between the sample means.

  • For each pair, the p-value is the probability under the null hypothesis (and other assumptions; see notes) of observing such an extreme value of the statistic, considering that many pairwise comparisons are being performed.

  • Confidence intervals for the difference between each pair of means are also available.

  • https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.tukey_hsd.html#scipy.stats.tukey_hsd

res = stats.tukey_hsd(diet_loss_data[diet_loss_data['Diet'] == 'Diet A']['Weight loss kg'],
               diet_loss_data[diet_loss_data['Diet'] == 'Diet B']['Weight loss kg'],
               diet_loss_data[diet_loss_data['Diet'] == 'Diet C']['Weight loss kg'])
print (res)
Tukey's HSD Pairwise Group Comparisons (95.0% Confidence Interval)
Comparison  Statistic  p-value  Lower CI  Upper CI
 (0 - 1)     -0.404     0.777    -1.832     1.024
 (0 - 2)     -2.579     0.000    -4.007    -1.151
 (1 - 0)      0.404     0.777    -1.024     1.832
 (1 - 2)     -2.175     0.001    -3.603    -0.747
 (2 - 0)      2.579     0.000     1.151     4.007
 (2 - 1)      2.175     0.001     0.747     3.603

Output

  • Post hoc testing shows that there is no significant difference between weight loss on diets A and B.
  • However, It is significantly higher in diet C compared to diet A (p<.001) and diet B (p=.001). Cohen’s d shows that these differences have a large effect size.

Reporting

  • Independent one way ANOVA showed a significant effect of the type of diet on weight loss after 10 weeks (F (2, 69) =46.184, p<.001, ω2 = 0.214.
  • Post hoc testing using Tukey’s correction revealed that diet C resulted in significantly greater weight loss than diet A (p<.001) or diet B (p=.001).
  • There were no significant differences in weight loss between diets A and B (p=.777)
!pip3 install --upgrade scipy
Requirement already up-to-date: scipy in /home/deebuls/.local/lib/python3.8/site-packages (1.10.1)
Requirement already satisfied, skipping upgrade: numpy<1.27.0,>=1.19.5 in /home/deebuls/.local/lib/python3.8/site-packages (from scipy) (1.22.4)

Linear Regression

  • This approach is founded on the insight that t-tests and one-way ANOVAs are all special cases of a general linear regression model.
  • Using regression analysis can help ensure consistency in analyzing and comparing results across studies.
  • It also offers a straightforward way to control for confounding variables.
  • Regression tests the null hypothesis (Ho) that there will be no significant prediction of the dependent(outcome) variable by the predictor variable(s).
g = sns.scatterplot(data=weight_gain_data, x="Weight Before", y="Weight After")

result = stats.linregress(weight_gain_data['Weight Before'], weight_gain_data['Weight After'])
print(result.intercept, result.intercept_stderr, result.pvalue)
20.32287010489955 6.04654012186869 2.1185490670472105e-12
plt.plot(weight_gain_data['Weight Before'], weight_gain_data['Weight After'], 'o', label='original data')

plt.plot(weight_gain_data['Weight Before'], result.intercept + result.slope*weight_gain_data['Weight Before'], 'r', label='fitted line')

plt.legend()

plt.show()

Assumptions of Statistical Tests

  • All of the above-mentioned statistical tests are only valid given certain assumptions.
  • For example, ANOVA assumes independence of observations, a normal distribution of residuals, and ho- moscedasticity of variances.
  • There exist statistical methods that check for violations of these assumptions.

Normality Test

  • Before running the data we have to do the Normality Test

  • How to check if the data is normal

  • scipy.stats.normal_

  • Test whether a sample differs from a normal distribution.

  • This function tests the null hypothesis that a sample comes from a normal distribution.

  • It is based on D’Agostino and Pearson’s test that combines skew and kurtosis to produce an omnibus test of normality.

k2, p = stats.normaltest(diet_loss_data[diet_loss_data['Diet'] == 'Diet A']['Weight loss kg'])
print ('pvalue ', p)
pvalue  0.992891828851092

Reporting

p value reported is higher than 0.05 so we have to accept the hypothesis that the data is normally distirbuted

Other Test

  • COMPARING MORE THAN TWO RELATED GROUPS - RMANOVA
  • COMPARING INDEPENDENT GROUPS AND THE EFFECTS OF COVARIATES : ANCOVA
    • ANOVA can be used to compare the means of one variable (dependent) in two or more groups,
    • whereas analysis of covariance (ANCOVA) sits between ANOVA and regression and compares the means of one (dependent) variable in two or more groups while taking into account the variability of other continuous variables (COVARIATES).
  • TWO-WAY INDEPENDENT ANOVA
    • One-way ANOVA tests situations when only one independent variable is manipulated,
    • two-way ANOVA is used when more than 1 independent variable has been manipulated.
    • In this case, independent variables are known as factors.

Bayesian Statistics

Bayesian Data Analysis with JASP

Bayesian Statistics

  1. Bayesian Binomial test
  2. Bayesian ANOVA test
  3. Bayesian Regression test
  4. ….

Bayesian T-test

BEST: Bayesian Estimation Supersedes the t-test

Which Test should I use ?

Step 1 : Type of test

  • Comparing one sample to a known or hypothesized population mean.
  • Testing relationships between two or more variables
  • Predicting outcomes
  • Testing for differences between two independent groups
  • Testing for differences between two related groups
  • Testing for differences between three or more independent groups
  • Testing for differences between three or more related groups
  • Test for interactions between 2 or more independent variables

Comparing one sample to a known or hypothesized population mean.

Task

Find the statistical test ?

  1. Experiment 1 : Weight Gain ?

  2. Experiment 2 : Response to eqye color

  3. Experiment 3: Brain size Inteligence

What Happens for Large N?

  • Have you seen p-value, statistical test in the papers you have read ?

  • Any Deep learning paper ?

  • The larger the sample size, the easier it is to show differences…

  • … but for large sample sizes, we can show any statistical significant difference no matter how small it is

  • A statistically significant difference does not tell anything about if the difference is meaningful!

  • See concept of “informativeness”

  • For large N, differences may by statistically significant but practically meaningless!

Ethical Question

  • Dark side of statistics
  • https://en.wikipedia.org/wiki/How_to_Lie_with_Statistics
  • https://genomicsclass.github.io/book/pages/plots_to_avoid.html
  • https://www.nature.com/articles/s41467-020-19160-7

Conclusion

Thank You

References

  • https://scipy-lectures.org/packages/statistics/index.html

  • https://ethanweed.github.io/pythonbook/landingpage.html

  • https://scientistseessquirrel.wordpress.com/2018/10/02/presenting-statistical-results-pointers-in-scientific-writing/

  • https://web.archive.org/web/20091029162244/http://www.wiwi.uni-muenster.de/ioeb/en/organisation/pfaff/stat_overview_table.html

  • https://www.jmp.com/en_sg/statistics-knowledge-portal/t-test.html

  • Writing Good hypothesis for robotics experiments

  • Hypothesis Testing

Lab Class

  1. Statistics with JASP
  2. Exercise 1

Exercise: Statiscital Test 1

‘Therapeutic Touch’ (TT) is a nursing practice rooted in mysticism but alleged to have a scientific basis. Practitioners of TT claim to treat medical conditions by using their hands to manipulate a ‘human energy field’ perceptible above the patients’ skin. Being a skeptical mind, 9-year old Emily Rosa ventured to test these claims by designing an experiment where TT practitioners had to correctly identify the location of the experimenter’s hand (above their left or right hand) while being blinded by a screen. The results of this experiment were later published in the prestigious Journal of the American Medical Association (Rosa et al., 1998).

    1. Open the “Emily Rosa” dataset https://osf.io/3wg6m/m (under “5. Frequencies”).
    1. Get a descriptive overview of the data by producing a frequency table of the variable Outcome. How many percent of practitioners guessed correctly?
    1. What is the type of test ? Which test will you use ? Explain your answer ?
    1. State the null and alternate huypothesis.
    1. Conduct the test . Report will you accept the Null hypothesis?

Exercise: Statistical Test 2

A teacher believes that directed reading activities in the classroom can improve the reading ability of elementary school children. She convinces her colleagues to give her the chance to try out the new method on a random sample of 21 third-graders. After they participated for 8 weeks in the program, the children take the Degree of Reading Power test (DRP). Their scores are compared to a control group of 23 children who took the test on the same day and followed the same curriculum apart from the reading activities.

    1. Open the “Directed Reading Activities” or https://osf.io/3wg6m/ (under “2. T-Tests”).
    1. Get a descriptive overview of the data.
      1. Create a table that shows the means and standard deviations of DRP scores in the control and treatment group.
      1. Create a boxplot that shows the distribution of DRP scores for each group
    1. What is the type of test ? Which test will you use ? Explain your answer ?
    1. State the null and alternate huypothesis.
    1. Conduct the test . Report will you accept the Null hypothesis?

Exercise :

A researcher is interested in what attributes people look for in a prospective partner. Specifically, the researcher wants to know whether looks or personality play an important role in this selection process, and whether the importance of these factors differs between men and women. They decide to design a speed-dating study where 20 people – 10 men and 10 women – each have speed-dates with 9 other people, who differ in attractiveness (either attractive, average, or ugly) and charisma (either high charisma, some charisma, or no charisma). After each speed date, the participants of the study rate their preference for each date, allowing the researcher to assess the influence of attractiveness and charisma on dating preference, and whether the influence differs between male and female raters. (Note that this is a fictional data set created by Andy Field.)

    1. Open the “Looks or Personality” dataset https://osf.io/3wg6m/ (under “3. ANOVA”).
    1. Create a table that shows the means and standard deviations of preference scores across attractiveness, charisma, and gender. Is this table easy to interpret? What might be an alternative method for obtaining a descriptive overview of the data?
    1. What is the type of test ? Which test will you use ? Explain your answer ?
    1. State the null and alternate huypothesis.
    1. Conduct the test . Report will you accept the Null hypothesis?