IMPRSgBGC course 'Applied statistics & data analysis' 2018
Category: Skill course
0.2 CP per course day
1. Basic statistics
1.1 Organizational issues
Date: September 1113, 2017
Place: Seminar room B0.002 @ MPIBGC
Planned sessions:
 09:00  10:30
 10:45  12:15
 13:15  14:45
 15:00  16:30
Instructor: Jens Schumacher
1.2 Aims and scope
The course will start with an overview of the "standard statistical toolbox", reviewing basic statistical approaches like correlation, linear regression and analysis of variance. Special emphasis will be put on test of assumptions and statistical model selection. This will naturally lead us to situations were standard assumptions are not fulfilled but the same type of questions is still to be answered.
Learn R… Here is a list of useful online resources to help you bring your R skills to a new level.
The material from the R basics course might also be useful for you.
The aim of the course is to introduce into basic statistical thinking and to enable you to look at your data statistically. Each block will be accompanied by practicals where example data are analyzed using the software package R.
1.3 Interested?
Prerequisites: Basic knowledge of a language of scientific computing: R, Matlab (exercises will be in R)
The course can be a 'standalone' or a preparation for the module 'Advanced statistics and data analysis'. Register here by August 23.
1.4 What you need to prepare
Bring a laptop and make sure that a recent version of R is running on it.
You can download the most recent version here: http://www.rproject.org/.
You might like >> RStudio, an opensource integrated development environment that runs on all platforms. It nicely combines console, script editor, working directory, plots etc. into a an uncluttered layout that you can easily navigate. You need to have R installed before you can use RStudio as a development environment.
Please also make sure that you can access the internet via WLAN (BGCusers, if you have a BGCaccount; eduroam or BGCguests, if you don't have an account)
1.5 Preliminary agenda
Day  Topic 

Monday, Sept 11  
Introduction to basic statistical tools
 
Tuesday, Sept 12  
What can be done if standard assumptions are not satisfied?
 
Wednesday, Sept 13  
Introduction into linear mixed models, basic ideas of nonparametric methods of curve estimation 
1.6 Material
Handouts by Jens Schumacher
Data by Jens Schumacher
Material for the practicals by Jens Schumacher
1.7 Feedback
The survey results are available here. Statistics and statements should not be taken as an exhaustive or exclusive list.
2. Advanced statistics and machine learning for data analysis
2.1 Organizational issues
Date: January 2224 and 26, 2018
Place: Seminar room B0.002 @ MPIBGC
Starting time: 9:30 a.m.
Instructors:
 Fabian Gans
 Paul Bodesheim
 Guido Kraemer
 Thomas Wutzler
 Additional people might contribute to specific subjects.
2.2 Aims and scope
The course aims at giving an overview on concepts of (some advanced) applied statistics and machine learning methods for data analysis. We will cover topics such multivariate explorations, dimensionality reduction, data visualization, multivariate predictions, and time series analysis. The doctoral candidate should obtain a broad overview on the currently used techniques, they must be able to “read” results produced by most important methods, and interpret the statistics correctly as well as with caution. We will provide the participants with perspectives offered by stateoftheart methods and give orientations where to start their own analyses. Exercises will emphasize some techniques that we think are most suitable in the context of Earth system sciences (and depending on the demand: ecology).
Structure
Every day will contain at least one lecture on a specific topic – complemented with exercises. In addition, each participant prepares a presentation on a specific topic and acts as “expert” for this method during the course.
After the breaks, we will have 2 short presentations by the participants (see below). So far we are planning the following topics for the days:
Day  Topic  Instructor(s) 

Mon, Jan 22  Concepts of (linear/nonlinear) multivariate data explorations  Guido Kraemer 
 
Tue, Jan 23  Concepts of multivariate nonlinear predictions I  Paul Bodesheim 
 
Wed, Jan 24  Concepts of multivariate nonlinear predictions II  Paul Bodesheim 
 
Fri, Jan 26  Time series analysis  Fabian Gans 

2.3 Interested?
Prerequisites:
 Basic knowledge of a language of scientific computing: R, Matlab
 Make use of the R course  The basics
 Either the course 'Basic statistics' or recalling the typical “statistics 1” type of lectures from university.
Exercises will be in R – the use of any other language is welcome; however support depends on the person in charge and cannot be guaranteed.
The course can be a 'standalone' (separate certificate) if you have a solid background in basic statistics. You can brush up your skills with the course 'Basic statistics'. Register here by December 5.
2.4 What else you need to prepare
Bring a laptop and make sure that a recent version of R is running on it.
You can download the most recent version here: http://www.rproject.org/.
Also install >> RStudio, an opensource integrated development environment that runs on all platforms. It nicely combines console, script editor, working directory, plots etc. into a an uncluttered layout that you can easily navigate. You need to have R installed before you can use RStudio as a development environment.
Please also make sure that you can access the internet via WLAN (BGCusers, if you have a BGCaccount; BGCguests, if you don't have an account).
2.5 Requirements for the assignment
All participants have to prepare a short presentation on one "unconventional" method of their choice: Every day will have a few of these presentations and we want to discuss with you about the pros and cons: Please register for one of the following topics (but feel free to add another one).
Important
 Don’t choose a technique that you know already!
 Check the list of participants below and choose a topic that has not yet been selected. Ideally, we would like to cover all topics.
Use the reference as a starting point … and note that we are not necessarily experts in the methods.
#  Topic  Starting reference  Context  Difficulty (13) 
1 (Fabian)  Misuses of statistical analysis in climate research  von Storch, H., 1995: Misuses of statistical analysis in climate research. In H. von Storch and A. Navarra (eds.): Analysis of Climate Variability Applications of Statistical Techniques. Springer Verlag, 1126  General  1 
2 (Thomas)  Model validation and verification: perspectives  Environmental perspective: Oreskes N, ShraderFrechette K & Belitz K (1994) Verification, validation, and confirmation of numerical models in the earth sciences. Science, AAAS, 263, 641 Information science perspective: Sargent R (2005) Verification and validation of simulation models. , 130143 Philosophical perspective: Kleindorfear G & Geneshan R (1993) The philosophy of science and validation in simulation. , 5057  General  1 
3 (Thomas)  Model validation: metrics  Janssen P & Heuberger P (1995) Calibration of processoriented models. Ecological Modelling, Elsevier, 83, 5566 , 0.1016/03043800(95)000849 Taylor plot: Taylor K (2001) Summarizing multiple aspects of model performance in a single diagram. Journal of Geophysical Research, WileyBlackwell, 106, 7183 Kling Gupta efficiency: Gupta H, Kling H, Yilmaz K & Martinez G (2009) Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling. Journal of Hydrology, Elsevier BV, 377, 8091  General  2 
4 (Guido)  Small n, large p  Schäfer, J., and K. Strimmer (2005) A shrinkage approach to largescale covariance matrix estimation and implications for functional genomics. Statist. Appl. Genet. Mol. Biol. 4, 32.  General  3 
5 (Paul)  Boosted regression trees  Elith et al. (2008) A working guide to boosted regression trees. J of Animal Ecology 77, 802813.  Prediction  2 
6 (Paul)  Feature selection  Saeys et al. (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23 25072517.  Prediction  2 
7 (Fabian)  Visibility graphs for time series  Lacasa et al. PNAS 105, 49724975  Time series  1 
8 (Fabian)  Visibility graphs for spatial data  de Berg, Mark; van Kreveld, Marc; Overmars, Mark; Schwarzkopf, Otfried (2000), Chapter 15: Visibility Graph", Computational Geometry (2nd ed.), SpringerVerlag, pp. 307–317  Explorative  ? 
9 (Paul)  Clustering with kmeans and Gaussian mixture models  Chapter 9 of Christopher M. Bishop: Pattern Recognition and Machine Learning. Springer 2006.  Classification  1 
10 (Fabian)  Detecting large spatiotemporal extreme events  LloydHughes, B., (2012) A spatiotemporal structurebased approach to drought characterization. International Journal of Climatology 32, 406–41. Zscheischler et al. (2013) Ecological Informatics 15, 6673.  Spatiotemporal exploration  1 
11 (Paul)  Probability distributions and density estimation  Chapter 2 of Christopher M. Bishop: Pattern Recognition and Machine Learning. Springer 2006.  General  1 
12 (Paul)  Largescale nearest neighbor search  Muja, M. & Lowe, D. G.: Scalable Nearest Neighbor Algorithms for High Dimensional Data. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2014, 36, pages 22272240  General / Prediction  1 
13 (Paul)  Anomaly detection with isolation forest  Fei Tony Liu, Kai Ming Ting, and ZhiHua Zhou: Isolation Forest. International Conference on Data Mining (ICDM) 2008, pages 413422  General / Explorative  2 
14 (Fabian)  Recurrence plots  To be discussed by email*  Time series  23 
15 (Fabian)  Recurrence plot metrics (RQA)  To be discussed by email*  Time series  23 
16 (Guido)  Autoencoder  Hsieh, W.W., 2001. Nonlinear principal component analysis by neural networks. Tellus 53A: 599615 Chapter 2 of Gorbanʹ, A.N. (Ed.), 2008. Principal manifolds for data visualization and dimension reduction, Lecture notes in computational science and engineering. Springer, Berlin ; New York. Section 2 of Hsieh, W.W., 2004. Nonlinear multivariate and time series analysis by neural network methods. Rev. Geophys. 42, RG1003. doi:10.1029/2002RG000112  Dimensionality Reduction  2 
17 (Fabian)  What is longrange memory in time series  To be discussed by email*  Time series  2 
18 (Thomas)  Model calibration  van Oijen M, Rougier J & Smith R (2005) Bayesian calibration of processbased forest models: bridging the gap between models and data. Tree Physiol, 25, 915927 Omlin M & Reichert P (1999) A comparison of techniques for the estimation of model prediction uncertainty. Ecological modelling, Elsevier, 115, 4559  Time series  23 
19 (Fabian)  What are surrogate data?  Venema et al. (2006) Nonlinear Processes in Geophysics 13, 449466. http://www2.meteo.unibonn.de/mitarbeiter/venema/themes/surrogates/iaaft/iaaft_articles.html  Time series  3 
20 (Thomas)  How can I use bootstrapping?  Efron B & Tibshirani R (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical science, Institute of Mathematical Statistics, , 5475 ,  Time series  2 
21 (Guido)  Multivariate Indicator Approaches  Wolter, K., Timlin, M., 1993. Monitoring ENSO in COADS with a seasonally adjusted principal component index. NOAA/NMC/CAC, NSSL, Oklahoma Clim. Survey, CIMMS and the School of Meteor., Univ. of Oklahoma, Norman, OK. Wolter, K., Timlin, M.S., 2011. El Niño/Southern Oscillation behaviour since 1871 as diagnosed in an extended multivariate ENSO index (MEI.ext). Int. J. Climatol. 31, 1074–1087. doi:10.1002/joc.2336 https://www.esrl.noaa.gov/psd/enso/mei/ https://www.esrl.noaa.gov/psd/enso/mei.ext/index.html  Dimensionality Reduction  2 
22 (Guido)  tSNE  van der Maaten, L., Hinton, G., 2008. Visualizing Data using tSNE. J. Mach. Learn. Res. 9, 2579–2605. https://lvdmaaten.github.io/tsne/  Dimensionality Reduction  2 
2.6 Feedback
The survey results are available here. Statistics and statements should not be taken as an exhaustive or exclusive list.