Applied statistics & data analysis
 

1.  Basic statistics

1.1  Organizational issues

Date: September 9-11, 2015
Place: Seminar room B0.002 @ MPI-BGC
Planned sessions:

  • 09:00 - 10:30
  • 10:45 - 12:15
  • 13:15 - 14:45
  • 15:00 - 16:30

Instructor: Jens Schumacher

 

1.2  Aims and scope

The course will start with an overview of the "standard statistical toolbox", reviewing basic statistical approaches like correlation, linear regression and analysis of variance. Special emphasis will be put on test of assumptions and statistical model selection. This will naturally lead us to situations were standard assumptions are not fulfilled but the same type of questions is still to be answered.

The aim of the course is to introduce into basic statistical thinking and to enable you to look at your data statistically. Each block will be accompanied by practicals where example data are analyzed using the software package R.

 



Learn R… Here is a list of useful online resources to help you bring your R skills to a new level.
The material from the R basics course might also be useful for you.

 

1.3  Interested?

Prerequisites: Basic knowledge of a language of scientific computing: R, Matlab (exercises will be in R)
This course is designed for IMPRS-gBGC doctoral researchers, especially in the 1st and 2nd year. The course can be a 'stand-alone' (separate certificate) or a preparation for the module 'Advanced statistics and data analysis'. Register here.

 

1.4  What you need to prepare

Bring a laptop and make sure that a recent version of R is running on it.
You can download the most recent version here: http://www.r-project.org/.
You might like >> RStudio, an open-source integrated development environment that runs on all platforms. It nicely combines console, script editor, working directory, plots etc. into a an uncluttered layout that you can easily navigate. You need to have R installed before you can use RStudio as a development environment.

Please also make sure that you can access the internet via WLAN (BGC-users, if you have a BGC-account; BGC-guests, if you don't have an account)

 

1.5  Preliminary agenda

DAY 1: Introduction to basic statistical tools:

  • correlation,
  • linear regression,
  • analysis of variance,
  • model selection

 

 

DAY 2: What can be done if standard assumptions are not satisfied?

  • dealing with variance heterogeneity,
  • spatial and temporal autocorrelation

 

 

DAY 3: Introduction into linear mixed models, basic ideas of nonparametric methods of curve estimation

 

 

1.6  Course material

Slides

Handout
Practical - Linear regression
Reference card

Example data

ZIP file containing example data
R scripts

 

1.7  Feedback

23 out of 26 participants filled in the survey by September 24, 2015. Thanks a lot for taking time! Your feedback is valuable because it helps the instructors and organizers to improve the individual modules and the general structure of the course.
The survey results are available here. Statistics and statements should not be taken as an exhaustive or exclusive list.

 

2.  Advanced statistics and data analysis

2.1  Organizational issues

Date: September 14-17, 2015
Place: Seminar room B0.002 @ MPI-BGC
Starting time: 9 a.m.
Instructors:

 

2.2  Aims and scope

The course aims at giving an overview on concepts of (some advanced) applied statistic and data analytic methods. We will cover topics on multivariate explorations, multivariate predictions, time series analysis, and model evaluation. The doctoral candidate should obtain a broad overview on the currently used techniques, they must be able to “read” results produced by most important methods, and interpret the statistics correctly and with caution. We will provide the participants with perspectives offered by state-of-the-art methods and give orientations where to start their own analyses. Exercises will emphasize a few techniques only that we think a most suitable in the context of Earth system sciences (and depending on the demand: ecology). In particular, we will cover the following topics:

Structure

Every day will contain at lease one lecture one a specific topic – complemented with exercises. In addition, each participant prepares a presentation on a specific topic and acts as “expert” for this very method during the course.

After the breaks, we will have 2 short presentations by the participants (see below). So far we are planning the following topics for the days:

 

DAY 1: Concepts of (linear/nonlinear) multivariate data explorations

  • multivariate visualizations
  • dimensionality reduction

 

 

DAY 2: Concepts of multivariate (nonlinear) predictions

  • Regression trees + cross validation
  • ANNs

 

 

DAY 3: Time series analysis

  • Fourier
  • Wavelets
  • SSA,
  • … multivariate cases?

 

 

DAY 4: Model evaluation

  • concepts
  • metrics
  • caveats
  • incl. links to all previously mentioned techniques

 

 

2.3  Interested?

Prerequisites:

  • Basic knowledge of a language of scientific computing: R, Matlab
  • Make use of the R course - The basics
  • Either the course 'Basic statistics' or recalling the typical “statistics 1” type of lectures from university.

Exercises will be in R – the use of any other language is welcome; however support depends on the person in charge and cannot be guaranteed.

The course can be a 'stand-alone' (separate certificate) if you have a solid background in basic statistics. You can brush up your skills with the course 'Basic statistics'.

Register here.

 

2.4  What else you need to prepare

Bring a laptop and make sure that a recent version of R is running on it.

You can download the most recent version here: http://www.r-project.org/.
Also install >> RStudio, an open-source integrated development environment that runs on all platforms. It nicely combines console, script editor, working directory, plots etc. into a an uncluttered layout that you can easily navigate. You need to have R installed before you can use RStudio as a development environment.

Please also make sure that you can access the internet via WLAN (BGC-users, if you have a BGC-account; BGC-guests, if you don't have an account).

2.5  Requirements for the assignment

All participants have to prepare a short presentation on one "unconventional" method of your choice: Every day will have a few of these presentations and we want to discuss with you about the pros and cons: Please register for one of the following topics (but feel free to add another one).

Important

 
  • Don’t choose a technique that you know already!
  • Check the list of participants below and choose a topic that has not yet been selected. Ideally, we would like cover all topics.

Use the reference as a starting point … and note that we are not necessarily experts in the methods.

 
#TopicStarting reference ContextDifficulty (1-3)
 
1Misuses of statistical analysis in climate researchvon Storch, H., 1995: Misuses of statistical analysis in climate research. In H. von Storch and A. Navarra (eds.): Analysis of Climate Variability Applications of Statistical Techniques. Springer Verlag, 11-26General1
 
2Model validation and verificationEnvironmental perspective: Oreskes N, Shrader-Frechette K & Belitz K (1994) Verification, validation, and confirmation of numerical models in the earth sciences. Science, AAAS, 263, 641

Information science perspective: Sargent R (2005) Verification and validation of simulation models. , 130-143 Philosophical perspective: Kleindorfear G & Geneshan R (1993) The philosophy of science and validation in simulation. , 50-57

General1
 
3Variance partitioningChevan, A. & Sutherland, M. (1991) Hierarchical partitioning. The American Statistician, 45, 90–96.General1
 
4Small n, large pSchäfer, J., and K. Strimmer (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statist. Appl. Genet. Mol. Biol. 4, 32.

http://strimmerlab.org/software/corpcor/

General3
 
5Boosted regression treesElith et al. (2008) A working guide to boosted regression trees. J of Animal Ecology 77, 802-813.Prediction2
 
6Feature selectionSaeys et al. (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23 2507-2517.Prediction2
 
7Visibility graphs for time seriesLacasa et al. PNAS 105, 4972-4975Time series1
 
8Visibility graphs for spatial datade Berg, Mark; van Kreveld, Marc; Overmars, Mark; Schwarzkopf, Otfried (2000), Chapter 15: Visibility Graph", Computational Geometry (2nd ed.), Springer-Verlag, pp. 307–317Explorative?
 
9Clustering by passing messages between data pointsFrey and Dueck (2007) Science 315, 972-976

Mézard (2007) Science 315, 949-951

Classification2
 
10Detecting large spatiotemporal extreme eventsLloyd-Hughes, B., (2012) A spatiotemporal structure-based approach to drought characterization. International Journal of Climatology 32, 406–41.

Zscheischler et al. (2013) Ecological Informatics 15, 66-73.

Spatiotemporal exploration1
 
11Climate networks: constructionTsonis, A.A. and Roebber, P.J. The architecture of the climate network. Physica A 333, 497-504.Spatiotemporal exploration1
 
12Climate networks metricsTo be discussed by email*Spatiotemporal exploration1
 
13Graph-valued regressionLiu et al.

http://books.nips.cc/papers/files/nips23/NIPS2010_0455.pdf

Prediction3
 
14Recurrence plotsTo be discussed by email*

http://www.recurrence-plot.tk/glance.php

Time series2-3
 
15Recurrence plot metrics (RQA)To be discussed by email*

http://www.recurrence-plot.tk/glance.php

Time series2-3
 
16Nonlinear PCA (via ANNs)Hsieh, W.W., 2001. Nonlinear principal component analysis by neural networks. Tellus 53A: 599-615

Hsieh, W.W. (2009) Machine Learning in the Environmental Sciences. Cambridge University Press (there is also a tutorial in the www, we didn’t find it so fast)

Spatiotemporal exploration2
 
17What is long-range memory in time seriesTo be discussed by email*Time series2
 
18Model calibrationvan Oijen M, Rougier J & Smith R (2005) Bayesian calibration of process-based forest models: bridging the gap between models and data. Tree Physiol, 25, 915-927

Omlin M & Reichert P (1999) A comparison of techniques for the estimation of model prediction uncertainty. Ecological modelling, Elsevier, 115, 45-59

Time series2-3
 
19What are surrogate data?Venema et al. (2006) Nonlinear Processes in Geophysics 13, 449-466.

http://www2.meteo.uni-bonn.de/mitarbeiter/venema/themes/surrogates/iaaft/iaaft_articles.html

Time series3
 

*) please email Miguel Mahecha.

 

2.6  Feedback

10 out of 12 participants filled in the survey by September 29, 2015. Thanks a lot for taking time! Your feedback is valuable because it helps the instructors and organizers to improve the individual modules and the general structure of the course.
The survey results are available here. Statistics and statements should not be taken as an exhaustive or exclusive list.

Go to Editor View