Applied statistics & data analysis, part 2 – Advanced
1. Audience
This course is designed for IMPRS-gBGC and iDiv PhD students, especially in the 1st and 2nd year.
2. Date
This part will take place on November 18-21, 2013 in Seminar room B0.002 @ MPI-BGC. Start is at 9 a.m.
3. Aims and scope
The course aims at giving an overview on the most important concepts of (some advanced) applied statistic and data analytic methods. We will cover topics on multivariate explorations, multivariate predictions, time series analysis, and model evaluation. The students should obtain a broad overview on the currently used techniques, they must be able to “read” results produced by most important methods, and interpret the statistics correctly and with caution. We will provide the students with perspectives offered by state-of-the-art methods and give orientations where to start their own analyses. Exercises will emphasize a few techniques only that we think a most suitable in the context of Earth system sciences (and depending on the demand: ecology). In particular, we will cover the following topics:
Structure
Every day will start with a lecture one a specific topic. We will either have exercises in between of the lectures or in the early afternoon.
After the breaks, we will have 2 short presentations by the students (see below). So far we are planning the following topics for the days:
- DAY 1: Introduction to concepts of (linear/nonlinear) multivariate data explorations
- multivariate visualizations
- multivariate correlations
- dimensionality reduction
- DAY 2: Introduction to concepts of multivariate (nonlinear) predictions
- Regression trees + cross validation
- ANNs
- DAY 3: Introduction to time series analysis
- Fourier
- Wavelets
- SSA,
- … multivariate cases?
- DAY 4: Introduction to model evaluation
- concepts
- metrics
- caveats
- incl. links to all previously mentioned techniques
- Introduction to the julia language
- Either the course ''>> Applied statistics & data analysis: Part 1” or recalling the typical “statistics 1” type of lectures from university.
- Basic knowledge of a language of scientific computing: R, Matlab, julialang.org
- Exercises will be in R – the use of any other language is welcome; however support depends on the person in charge and cannot be guaranteed.
- Miguel D. Mahecha
- Martin Jung
- Fabian Gans
- Additional people might contribute to specific subjects.
7. Requirements for the assignment
All students have to prepare a very short presentation (up to 5 slides) on one unconventional method: Every day will have 3-5 of these presentations: Please register for one of the following topics (but feel free to add another one).
Important
- Don’t choose a technique that you know already!
- Check the list of participants below and choose a topic that has not yet been selected. Ideally, we would like cover all topics.
Use the reference as a starting point … and note that we are not necessarily experts in the methods.
# | Topic | Starting reference | Context | Difficulty (1-3) |
1 | Misuses of statistical analysis in climate research | von Storch, H., 1995: Misuses of statistical analysis in climate research. In H. von Storch and A. Navarra (eds.): Analysis of Climate Variability Applications of Statistical Techniques. Springer Verlag, 11-26 | General | 1 |
2 | Variance partitioning | Chevan, A. & Sutherland, M. (1991) Hierarchical partitioning. The American Statistician, 45, 90–96. | General | 1 |
3 | Small n, large p | Schäfer, J., and K. Strimmer (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statist. Appl. Genet. Mol. Biol. 4, 32. | General | 3 |
4 | Boosted regression trees | Elith et al. (2008) A working guide to boosted regression trees. J of Animal Ecology 77, 802-813. | Prediction | 2 |
5 | Feature selection | Saeys et al. (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23 2507-2517. | Prediction | 2 |
6 | Visibility graphs for time series | Lacasa et al. PNAS 105, 4972-4975 | Time series | 1 |
7 | Visibility graphs for spatial data | de Berg, Mark; van Kreveld, Marc; Overmars, Mark; Schwarzkopf, Otfried (2000), Chapter 15: Visibility Graph", Computational Geometry (2nd ed.), Springer-Verlag, pp. 307–317 | Explorative | ? |
8 | Clustering by passing messages between data points | Frey and Dueck (2007) Science 315, 972-976 Mézard (2007) Science 315, 949-951 | Classification | 2 |
9 | Detecting large spatiotemporal extreme events | Lloyd-Hughes, B., (2012) A spatiotemporal structure-based approach to drought characterization. International Journal of Climatology 32, 406–41. Zscheischler et al. (2013) Ecological Informatics 15, 66-73. | Spatiotemporal exploration | 1 |
10 | Climate networks: construction | Tsonis, A.A. and Roebber, P.J. The architecture of the climate network. Physica A 333, 497-504. | Spatiotemporal exploration | 1 |
11 | Climate networks metrics | To be discussed by email* | Spatiotemporal exploration | 1 |
12 | Graph-valued regression | Liu et al. | Prediction | 3 |
13 | Empirical Mode decomposition (for time series) | Huang, N. E. & Wu, Z. A review on Hilbert-Huang transform: method and its aplications to geophysical studies. Reviews of Geophysics, 2008, 6, RG2006 | Time series | 2-3 |
14 | Empirical Mode decomposition (for spatial data) | To be discussed by email* | Spatial exploration | 3 |
15 | Recurrence plots | To be discussed by email* | Time series | 2-3 |
16 | Recurrence plot metrics (RQA) | To be discussed by email* | Time series | 2-3 |
17 | Nonlinear PCA (via ANNs) | Hsieh, W.W., 2001. Nonlinear principal component analysis by neural networks. Tellus 53A: 599-615 Hsieh, W.W. (2009) Machine Learning in the Environmental Sciences. Cambridge University Press (there is also a tutorial in the www, we didn’t find it so fast) | Spatiotemporal exploration | 2 |
18 | What is long-range memory in time series | To be discussed by email* | Time series | 2 |
19 | What are surrogate data? | Venema et al. (2006) Nonlinear Processes in Geophysics 13, 449-466. http://www2.meteo.uni-bonn.de/mitarbeiter/venema/themes/surrogates/iaaft/iaaft_articles.html | Time series | 3 |
*) please email Miguel Mahecha.
8. What else you need to prepare
Bring a laptop and make sure that a recent version of R is running on it.
You can download the most recent version here: http://www.r-project.org/.
Also install >> RStudio, an open-source integrated development environment that runs on all platforms. It nicely combines console, script editor, working directory, plots etc. into a an uncluttered layout that you can easily navigate. You need to have R installed before you can use RStudio as a development environment.
Please also make sure that you can access the internet via WLAN (BGC-users, if you have a BGC-account; BGC-guests, if you don't have an account).
Feel free to also install Julia (>> julialang.org, version 0.2) and >> juliastudio. Alternatively, Julia will be available through a web browser.