Final project

Due by 11:59 PM on Wednesday, December 14, 2022

One member of each group should turn in the group’s work on D2L. The turned-in copy should have the group member’s names at the top.

Turn in your copies by 11:59 on the date listed on the Schedule

Requirements

Data analytics is inherently a hands-on endeavor. Accordingly, the final project for this class is hands-on. As per the overview page, the final project has the following elements:

  1. For your final project in this class, you will analyze existing data in some area of interest to you. You will aggregate (merge) datasets, “find a pattern in the data that cannot be expressed in a closed form”, and communicate your findings in a memo.1
  1. You must visualize (at least) three interesting features of that data. Visualizations should aid the reader in understanding something about the data that might not be readily aparent.[^4]

  2. You must come up with some analysis—using tools from the course—which relates your data to either a prediction or a policy conclusion. For example, if you collected data from Major League Baseball games, you could try to “predict” whether a left-hander was pitching based solely on the outcomes of the batsmen. If you assembled data from COVID lockdowns and unemployment, you could test hypotheses about the labor market and its reaction to COVID.

If you choose a prediction project, you should employ all of the tools we have learned regarding training and testing, and report the relevant measures of fit.

If you choose a policy conclusion project, you should explore the relationship you’re studying. If you find that low-income workers were most likely to be laid off during COVID, you should examine if this corresponds to, say, industry of employment (e.g. “service jobs were worst hit, and are most likely the job held by low-income workers”).

  1. You will submit two things via D2L:
  • A PDF of your report (see the outline below for details of what this needs to contain) rendered from your R Markdown using LaTeX as in our labs. You might want to write the prose-heavy sections in a word processor like Word or Google Docs and copy/paste the text into your R Markdown document, since RStudio doesn’t have a nice spell checker or grammar checker. This should have no visible R code, warnings, or messages in it anywhere. To do this, you must set echo = FALSE in the code chunk options knitr::opts_chunk$set(echo = FALSE, ...) at the beginning of your document template before you knit, and review your text carefully before turning it in.

  • The same PDF as above, but with all the R code in it (set echo = TRUE at the beginning of your document and reknit the file). Please label files in an obvious way. I will usually grade from this document.

This project is due by 11:59 PM on the date listed on the Schedule No late work will be accepted. For real. MSU has grading deadlines and I’ve given you every second that can be spared.

There is no final exam. This project is your final exam.

The project will not be graded using a check system, and will be graded by me (the main instructor, not a TA). I will evaluate the following four elements of your project:

  1. Technical skills: Was the project easy? Does it showcase mastery of R for data analysis? (20%)

  2. Visual design: Was the information smartly conveyed and usable? Was it beautiful? (25%)

  3. Analytic design: Was the analysis appropriate? Was it sensible, given the dataset? Did you miss any glaring issues? (20%)

  4. Story: Did we learn something? Did you explore your findings sufficiently deeply? (25%)

  5. Following instructions: Did you surpress R code as asked? (10%)

If you’ve engaged with the course content and completed the exercises and mini projects throughout the course, you should do just fine with the final project.

Teams

My team sucks; how can I punish them for their lack of effort?

On this front, we will be more supportive. While you have to put up with your team regardless of their quality, you can indicate that your team members are not carrying their fair share by issuing a strike. This processs works as follows: 1. A team member systematically fails to exert effort on collaborative projects (for example, by not showing up for meetings or not communicating, or by simply leeching off others without contributing.) 2. Your frustration reaches a boiling point. You decide this has to stop. You decide to issue a strike 3. You send an email with the following information: - Subject line: [SSC442] Strike against [Last name of Recipient] - Body: You do not need to provide detailed reasoning. However, you must discuss the actions (plural) you took to remedy the situation before sending the strike email.

A strike is a serious matter, and will reduce that team member’s grade on joint work by 10%. If any team-member gets strikes from all other members of his or her team, their grade will be reduced by 50%.

Strikes are anonymous so that you do not need to fear social retaliation. However, they are not anonymous to allow you to issue them without thoughtful consideration. Perhaps the other person has a serious issue that is preventing them from completing work (e.g., a relative passing away). Please be thoughtful in using this remedy and consider it a last resort.

I’m on a smaller-than-normal team. Does this mean that I have to do more work?

Your instructors are able to count and are aware the teams are imbalanced. Evaluations of final projects will take this into account. While your final product should reflect the best ability of your team, we do not anticipate that the uneven teams will lead to substantively different outputs.

Suggested outline

You must write and present your analysis as if presenting to a C-suite executive. If you are not familiar with this terminology, the C-suite includes, e.g., the CEO, CFO, and COO of a given company. Generally speaking, such executives are not particularly analytically oriented, and therefore your explanations need to be clear, consise (their time is valuable) and contain actionable (or valuable) information.2 The report should be no more than 3 pages (excluding figures and tables). - Concretely, this requires a written memo, which describes the data, analyses, and results. This must be clear and easy to understand for a non-expert in your field. Figures and tables do not apply to the page limit. You must use headers (# at the beginning of a line for sections, ## for subsections, etc.) for each of these sections. Make sure you inspect your final product before you upload it. - As in Project 1, do not narrarate your process. “We tried using summarize() and then found that…” has no business in a C-suite memo. Nor does “we looked at X then decided that Y was…”. Tell the reader what data you used and what you found, and not about the process.

Below is a very loose guide to the sort of content that we expect for the final project.

Introduction

Describe the motivation for this analysis. Briefly describe the dataset(s), and explain why the analysis you’re undertaking matters for society. (Or matters for some decision-making. You should not feel constrained to asking only “big questions.” The best projects will be narrow-scope but well-defined.) (≈300 words)

Theory and Background

Provide in-depth background about the data of interest and about your analytics question. (≈300 words)

Theory

Provide some theoretical guidance to the functional relationship you hope to explore. If you’re interested on how, say, height affects scoring in the NBA, write down a proposed function that might map height to scoring. Describe how you might look for this unknown relationship in the data, but do not describe the data yet. (≈300 words)

Hypotheses

Make predictions to form the questions that you will study. Declare what you think will happen (the theory section will explain the “why”). You should be able to reject or fail to reject these hypotheses based on your analyses. (≈250 words)

Data and Analyses

Data

Given your motivations, limits on feasibility, and hypotheses, describe the data you use. (≈100 words)

Analyses

Generate the analyses relevant to your hypotheses and interests. Here you must include three figures and must describe what they contain in simple, easy to digest language. Why did you visualize these elements? Your analyses also must include brief discussion.

Conclusion

What caveats should we consider? Do you believe this is a truly causal relationship? Or, if a predictive project, how well did you do and should we bet money / diagnose / rely on your prediction? Why does any of this matter to the decision-maker? (apx 75 words)


  1. Note that existing is taken to mean that you are not permitted to collect data by interacting with other people. That is not to say that you cannot gather data that previously has not been gathered into a single place—this sort of exercise is necessary. But you cannot stand with a clipboard outside a store and count visitors (for instance). That would require involving the Institutional Review Board (IRB) and nobody wants to do that.↩︎

  2. This exercise provides you with an opportunity to identify your marketable skills and to practice them. I encourage those who will be looking for jobs soon to take this exercise seriously.↩︎