Final Project Description Where to find data? Grading Examples for inspiration Important Dates About project proposal About exploratory analysis About blog posts About the analysis document A bunch of interested data sets available online Final Project Math 488P/575A: Principles of Data Science You final project is to do a novel data analysis to answer a question and write about it. This can be interpreted broadly and the requirements are discussed below. The rough outline of the project is: Start with a question. Find data that might get at that question. Play around with the data. Attempt to answer the question. Iterate. Communicate. Your project should have one significant aspect to it. Examples might include, put together a novel data set (e.g. scrape something from the web) answer an interesting question a “sophisticated” statistical/machine learning model
a really compelling visualization You can work solo or work in groups of up to 3 people. I can generate an initial non-binding group assignment. You could take my recommendation or totally ignore it and find your own teammates. See below for grading details and the group work policies. Final deliverable There are two final deliverable: a blog post and the analysis document. The final project is due Tuesday July 6th at 11:59pm. Blog post Write a blog post in R Markdown aimed at a general audience (think 538 (http://fivethirtyeight.com/)). should be 1000-1500 words have at least two figures See the section “About blog post” below. Analysis document All analysis document should be posted and well documented. All analysis document should be posted and well documented. The main technical results (plots, regressions, etc) should be written up in a well documented, supporting technical document (using R Markdown). You might also include R scripts for cleaning data or helper functions. See the section “About analysis document” below. Where to find data? You can find a seriously large amount of data online. I encourage you to “gather your own data online” by doing something like scraping Twitter (http://varianceexplained.org/r/trump-tweets/) though this is not expected. There are some obvious places to look like data.gov (data.gov? _&d2lSessionVal=2GuukUAXlEW744t2vjwVmpaRG&ou=53741). I’ve put together a collection of interesting data sets you can find online at the bottom of this page. If you are already doing research with a data set you are welcome to use it, but you have to do something new. Grading Your team’s grade will be 50% blog post and 50% analysis document. Your individual grade will be weighted by your team member’s reviews. The project will be graded on Communication Cl iti (b th i th bl t d i th ti t h i l d t)lear writing (both in the blog post and in the supporting technical document) Document code Accuracy Did you use reasonable statistics? Does your final code run? How well do your findings support your conclusions? Note that “The evidence is inconclusive” is a very possible, and completely acceptable answer. Ambition The project should take some creativity and e?ort i.e. should be more than a matter of copy/pasting code. Groupwork You will anonymously rate your team members and yourself on team citizenship (e.g. attends meetings, does what they promise, etc), not on ability. Final grades will be adjusted based on peer ratings. Individual grades are based on the project grade and a multiplied computed from the peer ratings. This multiplier will range from 1.05 (for people who go above and beyond) down to 0 (for people who don’t participate). As a last resort you may fire a team member who refuses to participate. Please contact the instructor well before it comes to this. If you are fired you must start a new project and your peer rating multiplier will take a hit.Examples for inspiration These are some examples of interesting analyses. Many of these examples would take longer than you have for the final project. These are meant to be inspirations but not expectations. Blog posts from polygraph (http://pudding.cool/) David Robinson’s text analysis of Trump tweets (http://varianceexplained.org/r/trump- tweets/) genre classification (http://josh-jacobson.github.io/genre-classification/) 538 on how baby boomers get high (http://fivethirtyeight.com/datalab/how-baby- boomers-get-high/) 538 on Bob Ross (http://fivethirtyeight.com/features/a-statistical-analysis-of-the-work-of- bob-ross/) see this page (http://d1b10bmlvqabco.cloudfront.net/attach/icf0cypdc3243c/hcwsitww5k95ka/ii7mfqhc946l/CS1 for links to the final final projects from CS109 (http://cs109.github.io/2015/pages/projects.html) (warning: a couple of links are broken). Important Dates Initial project proposal: due 6/23 at 11:59pm Describe your proposed project Who are on your team?Who are on your team? What question(s) will you try to answer? What data sets will you use? You should have already found and taken a first look at the data set How will you use the data to try to answer the question? Project proposals should be submitted as Piazza questions for all other students to see. I will make comments to these proposals. Note that these comments are meant to help you to refine your goals. You are not obligated to complete all tasks that you promised in the proposal. Exploratory analysis: due 6/30 at 11:59pm Write up your initial findings in an R Markdown document. You should have at least N plots (still deciding N, but at least N should be greater than 3). Analysis document: due 7/6 at 11:59pm Write up your technical results in an R Markdown document. Provide detailed comments so that it is clear to me what you have done. Put all code, data, etc together. Blog post: due 7/6 at 11:59pm should be 1000-1500 words have at least two figures target general audienceAbout project proposal Write a project proposal with your team. You should brainstorm a long list of ideas, then narrow it down to a couple that are feasible given your knowledge of R, the time constraints, and the available data. Write the proposal for one of these ideas, but you should keep a couple backups in case the original project doesn’t work out for some reason. The point of this exercise it to think though a reasonable project (and get feedback from the instructor). You will not be held to doing exactly what you say you will do in this proposal; expect to adapt your project as you continue to work on it (just ask Robert Burns (http://en.wiktionary.org/wiki/best_laid_plans) or Mike Tyson (http://articles.sun- sentinel.com/2012-11-09/sports/sfl-mike-tyson-explains-one-of-his-most-famous-quotes- 20121109_1_mike-tyson-undisputed-truth-famous-quotes).) The more you put into the proposal, however, the better your life will be 2 weeks from now. Deliverable Write a one page proposal posted on Piazza which discusses: What questions will you try to answer? List 5-10 possible questions. What data sets will you use? You should have already found and taken a first look at the data sets. Make sure the data is clean enough to reasonably use and actually has the information content to answer your questions. What are some things you will do with the data to get at your questions? For example whatWhat are some things you will do with the data to get at your questions? For example, what are some plots you might make. Include a list of 3 backup ideas you brainstormed, with a couple bullet points of detail. Just in case. Advice Meet once very early for an initial brainstorm. Have everyone go o? and explore some ideas. Meet again for a final brainstorm. Then write the proposal. Look at the data sets you plan on using to make sure they aren’t awful. If you plan on creating a data set (e.g. by scraping a website) convince me this will be feasible (you don’t have to have the scraper working perfectly). About exploratory analysis By this point you should have done an exploratory analysis and have initial results. What this means will vary from project to project so there aren’t many formal requirements. The point of this is to: take stock of where you are, show me that you have made good progress and convince me you will be able to finish the project. Basically we expect to see that you have the data asked/answered a bunch of questions by making lots of plots and computing statistics narrowed down the scope of the project to something coherent and manageable have some initial results What “initial results” means will also vary from project to project. For example, if the project is to build a model to predict Y based on X then you should have a looked at a few simple models Deliverables Gather everything into one folder called n_eda (where n = your group number, which I will assign to your group). This folder should have four subfolders: /summary, _results, /everything, and /data. Please zip the n_eda folder and submit it to Google Form that I will set up. 1. Write a summary of what you have tried, what you found and what you have le? to do. This document should be about a page and can be mostly bullet points. Put this document into a folder called summary.2. Have some form of initial results. This could be a .Rmd document with a couple plots. The initial results should be short and to the point. Put the initial results into a folder called initial_results. 3. Include the rest of the work you have done. Simply gather all the scripts/.Rmd files you have so far from each team member and put them into one folder (called everything). This is just so I can see all the work you have done. 4. Include the data. 7/1/2021 Final Project Description - Summer 2021 Principles of Data Science (MATH-575A-01, MATH-488P-01, MATH-488P-02, MATH-590S-01) http://brightspace.binghamton.edu/d2l/le/content/53741/viewContent/60761/View 6/9 About blog posts Write a blog post explaining what you found. It should answer: 1. What is the question(s) you tried to answer? Why should someone care? 2. What is the data/how did you get it? 3. How did you answer the questions (e.g. what statistical techniques, etc)? 4. What are your findings? Points 1 and 4 are the most important for the blog post (your analysis document focuses on 2 and 3). This blog post should be aimed at a general audience who is not afraid of graphs/a little data (think 538 (http://fivethirtyeight.com/)). The vast majority of the technical details should be in the analysis document. Requirements for the blog post The post should be 1000-1500 words. Include a title and your names. Don’t display R code unless it is used to convey a point. There should be at least 2 visualizations. Make sure to describe the figures somewhere in the text. These plots should be communicatory plots, not exploratory plots. The post should be submitted in .html (probably written in R Markdown)Submission Include everything that went into creating this plot post in a folder called n_blog (where n = your group number). You can name the blog post whatever you want, just make sure it is a .html document. Please compress n_blog and submit it to the Google Form to be set up. I plan on posting these blog posts and your analyses on the internet. If you do not want your name associated with the post (or if you don’t want even an anonymous version of the post displayed to the outside world on the internet), please let me as soon as possible. Grading Communication (80%) Does your main point come through (e.g. see here (http://www.storytellingwithdata.com/blog/2017/3/22/so-what))? Is the document written well and clearly? Yes spelling and grammar matter. Quality of the figures. E?ective communication? Ask your parents or friends to read your post and have them to give you feedback. Accuracy (10%) Do you accurately convey a rigorous argument? Ambition (10%) 7/1/2021 Final Project Description - Summer 2021 Principles of Data Science (MATH-575A-01, MATH-488P-01, MATH-488P-02, MATH-590S-01) http://brightspace.binghamton.edu/d2l/le/content/53741/viewContent/60761/View 7/9 Bonus points Your team will get up to 5 extra points on the final project grade if you do the following: make a webpage using github pages (see here (http://pages.github.com/)) Github pages are very easy to make. The webpage should showo? all aspects of your project including the blog post and technical analysis. The better this page is the more points you will get. About the analysis document Using R Markdown write a document called process_notebook describing process you used toconduct your analysis (note this description is borrowed from here (http://cs109.github.io/2015/pages/projects.html)). The process_notebook is the core document for the analysis. It should show the code for the entire analysis you did and include text justifying decisions you made (e.g. why did you remove certain observations, why median instead of mean, how did you select the variables for a model, etc). The target audience is: someone who knows R/statistics, but is unfamiliar with your project (i.e. the graders or even yourself three months from now). The process_notebook should detail the steps you took to develop a solution. This includes where you got the data, other solutions you tried, the statistical methods you chose and your findings. How you got to your conclusions is as important as the conclusions. This is where you can show all the work you put into this project. You should have lots of visualizations in the notebooks. Your discussion should hit on the following topics (depending on the project some of these will be more important than others): Abstract: one paragraph at the very beginning of the document summarizing everything. Overview and Motivation: Provide an overview of the project goals and the motivation for it. Consider that this will be read by people who did not see your project proposal. Related Work: Anything that inspired you, such as a paper, a web site, or something we discussed in class. Initial Questions: What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis? Data: Source, scraping method, cleanup, storage, etc. Exploratory Data Analysis: What visualizations did you use to look at your data in di?erent ways? What are the di?erent statistical methods you considered? Justify the decisions you made, and show any major changes to your ideas. How did you reach these conclusions? Final Analysis: What did you learn about the data? How did you answer the questions? How can you justify your answers? Make sure the reader can answer the question “What is the point?” (e.g. see here (http://www.storytellingwithdata.com/blog/2017/3/22/so-what)).Submission Gather everything into a folder called n_analysis (where n = your group number). This folder should have three sub-folders: /data, /results, /everything_else. Compress n_analysis and submit to Google Form. 1. /results: The /results folder should have a R Markdown document called process_notebook (include both the .Rmd and .html documents) and possibly several supporting .R scripts for helper functions you wrote. If you write helper functions (recommended) you should include them in separate .R scripts. The .Rmd document should assume the working directory is the n_analysis folder and should load the data accordingly (i.e. read_csv(‘data/my_cool_dataset.csv’)). We may knit the process_notebook.Rmd and it should run! The process_notebook should be mostly a matter of copy/pasting your analysis into a .Rmd document then adding discussion ( discussion should be in text, not in comments ). 2. /data: Put the data sets you used in this folder. If you started with a messy data set and did significant processing then you should include both the raw and the cleaned data sets in separate sub-folders i.e. /data/raw/ and /data/clean/. 3. /everything_else: You probably did a lot of stu? that didn’t make it in your final analysis. Include anything you did that you want to get credit for in this folder. If you have a lot of material in here that you want me to look at then you should include a text document in this folder pointing us to what you want us to look at.