IMPRS-gBGC course 'Applied statistics & data analysis' 2020, Advanced
 

Category: Skill course
0.2 per course day
 

1.  Advanced statistics

1.1  Organizational issues

Date: November 16 - 20, 2020
Place: lecture room @ MPI-BGC (depending on COVID-19 regulations)
Planned sessions:

 
  • 09:00 - 09:45 lecture
  • 09:45 - 10:00 break
  • 10:00 - 11:00 talks
  • 11:00 - 11:15 break
  • 11:15 - 12:00 excusion
  • 12:00 - 13:00 lunch
  • 13:00 - 14:00 talks
  • 14:00 - 14:15 break
  • 14:15 - 15:00 lecture
  • 15:00 - 17:00 practical part

Instructor:

 

1.2  Aims and scope

The course will cover selected topics of advanced statistics and machine learning. Lectures on some topics will be accompanied with presentations by participants, “Excursion” talks on applications in research, and basic practicals in the afternoon. The course requires basic knowledge of statistics. The practical session require basic knowledge with a programming language – examples will be provided in R.

 

1.3  Presentations by participants (mandatory for assignment)

Participants will give a presentation (20min + 10min Q&A) on a paper or topic of their choice. Below you can find a list of suggested papers. If you want to work on a topic in a team of 2 (i.e. 40min+20min Q&A) or suggest an alternative topic please inquire this until 31st October with the proposed topic to mjung@bgc-jena.mpg.de.

During registration please choose a topic that was not yet chosen.

All presentations need to be ready on Monday 16th Nov 2020 at 9 am. The detailed schedule will be announced then.

The presentations should be educational and try to focus on the important things one should know about a method when applying it, i.e. the principle, advantages, disadvantages, assumptions, and pitfalls, rather than all mathematic details, derivations, theorems and proofs. Practical examples are often very illustrative.

 

1.4  Other Preparations

Bring a laptop with a recent version of R being installed or running for the practicals. If you prefer another language, that is fine but we will not provide corresponding code examples. Please also make sure that you can access the internet via WLAN (BGC-users, if you have a BGC-account; BGC-guests, if you don't have an account).

 

1.5  Preliminary agenda

DayTopicWho
 
Monday, November 16 
 
9:00 - 09:45Introduction to basic statistical toolsMartin Jung
 
09:45 - 10:00Break
10:00 - 10:30Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy
10:30 - 11:00Toward the true near‐surface wind speed: Error modeling and calibration using triple collocation
11:00 - 11:15Break
11:15 - 12:00Hierarchical Clustering via Joint Between-Within Distances: Extending Ward's Minimum Variance Method
12:00 - 13:00Lunch Break
13:00 - 13:30Archetypal Analysis
13:30 - 14:00Visualizing Data using t-SNE
14:00 - 14:15Break
14:15 - 15:00Dimensionality reductionMirco Migliavacca
15:00 - 17:00PracticalMirco Migliavacca
 
Tuesday, November 17  
 
9:00 - 09:45Time series analysisLina Estupinan-Suarez
 
09:45 - 10:00Break
10:00 - 10:30Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure
10:30 - 11:00Summarizing multiple aspects of model performance in a single diagram
11:00 - 11:15Break
11:15 - 12:00EXCURSIONNora Linscheid
12:00 - 13:00Lunch Break
13:00 - 13:30BGI SEMINAR
13:30 - 14:00BGI SEMINAR
14:00 - 14:15Break
14:15 - 15:00Mixed effect modelThomas Wutzler
15:00 - 17:00PracticalThomas Wutzler
 
Wednesday, November 18 
 
9:00 - 09:45Random ForestsMartin Jung
 
09:45 - 10:00Break
10:00 - 10:30Bias in random forest variable importance measures: Illustrations, sources and a solution
10:30 - 11:00Isolation Forest
11:00 - 11:15Break
11:15 - 12:00EXCURSIONJacob Nelson
12:00 - 13:00Lunch Break
13:00 - 13:30A working guide to boosted regression trees
13:30 - 14:00A unified approach to interpreting model predictions
14:00 - 14:15Break
14:15 - 15:00Model evaluationMartin Jung
15:00 - 17:00PracticalSimon Bessnard
 
Thursday, November 19 
 
9:00 - 09:45Neural NetworksBasil Kraft
 
09:45 - 10:00Break
10:00 - 10:30Deep learning
10:30 - 11:00Long Short-Term Memory
11:00 - 11:15Break
11:15 - 12:00EXCURSIONBasil Kraft
12:00 - 13:00Lunch Break
13:00 - 13:30Variable ImportanceMartin Jung
13:30 - 15:00PracticalFabian Ganz
 
Friday, November 20 
 
9:00 - 09:45Parameter estimationNuno Carvalhais
 
09:45 - 10:00Break
10:00 - 10:30Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling
10:30 - 11:00A comparison of techniques for the estimation of model prediction uncertainty
11:00 - 11:15Break
11:15 - 12:00EXCURSIONTina Trautmann
12:00 - 13:00Lunch Break
13:00 - 13:30Deep learning and process understanding for data-driven Earth system science
13:30 - 14:00Feedback
 
 

1.6  Interested?

Prerequisites:

  • Basic knowledge of a language of scientific computing: R, Matlab
  • Make use of the R course - The basics
  • Either the course 'Basic statistics' or recalling the typical “statistics 1” type of lectures from university.

Exercises will be in R – the use of any other language is welcome; however support depends on the person in charge and cannot be guaranteed.
 

 



Learn R… Here is a list of useful online resources to help you bring your R skills to a new level.
The material from the R basics course might also be useful for you.

 

1.7  Material

Here, you can download the papers, which you will need for your presentation.

 

1.8  Requirements for the assignment

All participants have to prepare a short presentation on one "unconventional" method of their choice: Every day will have a few of these presentations and we want to discuss with you about the pros and cons: Please register for one of the following topics (but feel free to add another one).

Important

 
  • Don’t choose a technique that you know already!
  • Check the list of participants below and choose a topic that has not yet been selected. Ideally, we would like to cover all topics.

....and note that we are not necessarily experts in the methods.

 
# / NAME OF PRESENTERTopicContext
 
1 SOPHIA WALTERArchetypal AnalysisMultivariate data representation
 
2 / ANN-SOPHIE LEHNERTA working guide to boosted regression treesnon parametric regression
 
3 /From outliers to prototypes: Ordering datanovelty/outlier detection
 
4 / SANTIAGO BOTIALong Short-Term Memoryneural networks for time series
 
5Calibration of process-oriented modelsmodel calibration and evaluation
 
6 / SOPHIE VON FROMMDeep learningdeep learning overview
 
7 / CAGLAR KUCUKA unified approach to interpreting model predictionsvariable importance, explainable AI
 
8Quantile regression forestsarandom forest, quantile regression
 
9 /Deep learning and process understanding for data-driven Earth system sciencedeep learning and hybrid modeling for Earth System Science
 
10 / SINIKKA PAULUSCross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structuremodel evaluation
 
11MissForest—non-parametric missing value imputation for mixed-type datarandom forests, data imputation (filling missing data)
 
12 / WEIJIE ZHANGBias in random forest variable importance measures: Illustrations, sources and a solutionrandom forest, variable importance
 
13Measuring and Testing Dependence by Correlation of Distancesnon-linear correlation
 
14 /ULISSE GOMARASCAVisualizing Data using t-SNEdimensionality reduction, multivariate data visualization
 
15The energy of datanon-parametric statistics based on distances
 
16 / QIAN ZHANGSummarizing multiple aspects of model performance in a single diagrammodel evaluation
 
17 / WANTONG LIDecomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modellingmodel evaluation and calibration
 
18 / HOONTAEK LEEIsolation Forestrandom forest, novelty/outlier detection
 
19 / YUNPENG LUO Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracyuncertainty
 
20 / SIYUAN WANGA comparison of techniques for the estimation of model prediction uncertaintyuncertainty
 
21 /Verification, validation, and confirmation of numerical models in the earth sciencesmodel evaluation and calibration
 
22 / ALBRECHT SCHALLHierarchical Clustering via Joint Between-Within Distances: Extending Ward's Minimum Variance Methodclustering
 
23Locally Weighted Regression: An Approach to Regression Analysis by Local Fittingsmoothing
 
24 / JASPER DENISSENToward the true near‐surface wind speed: Error modeling and calibration using triple collocationuncertainty
 
 

2.  Feedback

Your feedback is valuable because it helps the instructors and organizers to improve the individual modules and the general structure of the workshop.
The survey results are available here. Statistics and statements should not be taken as an exhaustive or exclusive list.

Go to Editor View