Welcome to CSI 771

Computational Statistics

Fall, 2002

Instructor: James Gentle

Instructor's email: jgentle@gmu.edu

Class meets on Wednesdays from 4:30pm to 7:10pm.

Final exam is December 11 from 4:30pm to 7:15pm.

This Web page will evolve as the semester progresses.


This course is about modern, computationally-intensive methods in statistics. It emphasizes the role of computation as a fundamental tool of discovery in statistical analysis.

Topics to be covered include

  • Monte Carlo studies in statistics
  • Data partitioning and resampling
  • Nonparametric probability density estimation
  • Identification of structure in data
  • Statistical models and data fitting

    Prerequsites for this course include a course in applied statistics and a course in statistical inference.


    The text for the course is Elements of Computational Statistics.

    There are some(!) errata and other notes. Let me know of any other errors (including minor typos) that you find.

    You should look over the notation descriptions and definitions in Appendix C beginning on page 363.

    Student work in the course (and the relative weighting of this work in the overall grade) will consist of

  • a number of small assignments, problems, etc. (15)
  • a semester project to replicate and extend a published Monte Carlo study (30)
  • an in-class midterm (25)
  • a final exam consisting of an in-class component and a take-home component (30)

    You must have an account on a system that has a web server. The CSI system is scs.gmu.edu. There are several other possibilities, including the university systems mason.gmu.edu and osf1.gmu.edu, and systems in IT&E. If you do not have an account yet, you can get one on scs.gmu.edu by filling out a request form that you can get from the SCS office in 103 Science & Technology I.

    The scs.gmu.edu system requires a secure login (ssh) and secure ftp. You can get information about the system and options for accessing it at www.scs.gmu.edu/computing/

    Here's a source of utility freeware, including programs for ssh.

    Here's info on getting an account on the main GMU computers.

    Each student will prepare a Web page for presentation of the project and for some of the smaller assignments.
    Here's more info on making a webpage, especially on GMU computers.

    There are several programs that help you write html. I do not use any of these but you may find them useful. You can also produce html output directly from Microsoft Word. I do not use that for html either. (In fact, I use Word as infrequently as possible.)

    The main software used in the course will be S-Plus or R.
    A student version of S-Plus can be obtained at http://elms03.e-academy.com/splus/
    Information about R, including links for downloading, can be obtained at http://www.r-project.org/



    Schedule


    August 28

    Course overview; method of communication; Computer organization: Unix and basic tools; S-Plus, R.
    Monte Carlo studies.
    Random number generation in S-Plus and R.

    Assignment: Read Section 2.1 (pages 39-53) and Appendix A and B (pages 337-362).
    Make a web page for your project. Choose two articles in statistics literature that report Monte Carlo studies and write brief descriptions of them on your web page.

    Two examples from the March 2002 issue of the Journal of the American Statistical Association are the one by Hawkins and Olive on problems of resampling for robust regression estimators and the one by Qin, Leung, and Shao on estimation with nonignorable nonresponse. There are several more articles in that issue that use Monte Carlo simulation to study statistical methods.


    September 4

    Discussion of Monte Carlo studies; Student presentations of descriptions of articles (first project milestone).
    Discussion of methods of statistical inference (preview of Chapter 1).

    Assignment: Read Chapter 1; work problems 1.2, 1.3, 1.7, and 1.9 to turn in.
    This can be turned in late -- the assignment was a little harder than I thought, and also some people have had problems getting their programming skills up to par.
    Put a brief description of your project on your web page. You will add to this description as the semester progresses.


    September 11

    Discussion of projects if necessary (second project milestone)
    Discussion of some material from Chapter 1 on least squares, and of methods of optimization.
    Random number generation.

    Assignment: Read Chapter 2; work problems 2.2, 2.4, and 2.7 to turn in.
    This can be turned in next week with the other Chapter 2 problems.


    September 18

    Brief general discussion of projects (third project milestone)
    Review acceptance/rejection. Markov chain methods.

    Assignment: Work problems 2.8, 2.9, and 2.10 to turn in. Read Chapter 3.


    September 25

    Brief student presentations of the third project milestone.
    Inference using Monte Carlo: Monte Carlo tests, and "parametric bootstrap".
    Randomization and data partitioning.
    Bootstrap methods.
    Assignment: Read Chapter 4; work problems 3.6, 4.1, 4.5, 4.9
    Comments on Exercise 4.5

    October 2

    Measures of similiarity.
    Transformations.
    Assignment: Read Chapter 5; work problems 5.1, 5.2, 5.5, 5.8

    October 9

    Review

    October 16

    In-class midterm exam. Covers material from Chapters 1-5.
    This will be open book and open notes.

    October 23

    Estimation of functions.
    Student preliminary presentations (fourth project milestone).
    Assignment: Read Chapter 6; work problems 6.6, 6.7, 6.9, 6.10 (due Nov 6).

    October 30

    Nonparametric estimation of probability density functions.

    November 6

    Nonparametric estimation of probability density functions.
    The equation at the top of page 221 should be
    k_{rs} = r/(2B(s+1,1/r))
    You get this by making the change of variable x=t^r, and integrating over [0,1] and then doubling the integral.
    A better way of writing the value is to use B(1/r,s+1) because that is the most obvious integral you see when you rewrite it. Of course, B(1/r,s+1)=B(s+1,1/r), so it doesn't matter.
    Assignment: Read Chapter 9; work problems 9.3, 9.5, 9.10, 9.11 (due Nov 13).
    Comments on Exercises
    Review the project report of a fellow student (assignments made in class). You may need to contact the student to ask questions about what was done.
    Prepare a report (approximately 2 pages) in which you decribe the problem, identify the factors in the Monte Carlo experiment that was (will be) conducted, identify exactly what the student's extension is, and make general recommendations about how the student might improve the project or the presentation.
    Due November 20.

    November 13

    Clustering and classification.
    Assignment: Read Chapter 10; work problems 10.1, 10.5, 10.7, 10.12 (due Dec 4).

    November 20

    Project milestone 5.
    Clustering and classification.
    Models of dependencies.
    Assignment: Read Chapter 11; work problem 11.1 (due Dec 4).

    November 27

    No class (Thanksgiving recess).

    December 4

    Student presentations of projects (final project milestone).
    Handout take-home portion of final exam.

    December 11

    In-class final exam.
    This will be closed book and closed notes.

    Computational Resources

    Labs with Unix workstations are available for use in this class in both SCS and IT&E.
  • SCS facilities.
  • Software available in SITE labs.

    Other Resources

  • S (or S-Plus) Cheatsheet (courtesy of Barry Brown, University of Texas at Houston)

    The most important WWW repository of statistical stuff (datasets, programs, general information, connection to other sites, etc.) is StatLib Index at Carnegie Mellon.

    Students

    The students in the class all have webpages on which they put parts of their assignments and other interesting stuff.