19 Free Public Data Sets For Your First Data Science Project

sciping
sciping
sciping
454
文章
0
评论
2018年6月23日12:32:2019 Free Public Data Sets For Your First Data Science Project已关闭评论
摘要

Completing your first project is a major milestone on the road to becoming a data scientist. It’s also an intimidating process. The first step is to find an appropriate, interesting data set. You should decide how large and how messy a dataset you want to work with; while cleaning data is an integral part of data science, you may want to start with clean dataset for your first project so that you can focus on the analysis rather than on cleaning the data.

Completing your first project is a major milestone on the road to becoming a data scientist. It’s also an intimidating process. The first step is to find an appropriate, interesting data set. You should decide how large and how messy a dataset you want to work with; while cleaning data is an integral part of data science, you may want to start with clean dataset for your first project so that you can focus on the analysis rather than on cleaning the data.

Based on the learnings from our Introduction to Data Science Course and the Data Science Career Track, we’ve selected datasets of varying types and complexity that we think work well for first projects (some of them work for research projects as well!). These data-sets cover a variety of sources: demographic data, economic data, text data, and corporate data.

    1. United States Census Data: The United States Census publishes reams of demographic data at the state, city, and even zip code level. The data set is fantastic for creating geographic data visualizations and can be accessed on the Census Website. Alternatively, the data can be accessed via an API. One convenient way to use that API is through the chloroplethr. In general, this data is very clean and very comprehensive.
    2. FBI Crime Data: The FBI crime data set is fascinating. If you’re interested in analyzing time series data, you can use it to chart changes in crime rates at the national level over a 20 year period. Alternatively, you can look at the data geographically.
    3. CDC Cause of Death: The Center for Disease Control control maintains a database on cause of death. The data can be segmented in almost every way imaginable: age, race, year, and so on.
    4. Medicare Hospital Quality: Medicare maintains a database on complication rates by hospital that provides for interesting comparisons.
    5. SEER Cancer Incidence: The US government also has data about cancer incidence, again segmented by age, race, gender, year, and other factors.
    6. Bureau of Labor Statistics: Many important economic indicators for the United States (like unemployment and inflation) can be found on the Bureau of Labor Statistics website. Most of the data can be segmented both by time and by geography.
    7. The Bureau of Economic Analysis: The Bureau of Economic Analysis also has national and regional economic data, like GDP and exchange rates.
    8. IMF Economic Data: If you want a view of international data, you can find it on the IMF website.
    9. Dow Jones Weekly Returns: Predicting stock prices is a major application of data analysis and machine learning. One dataset to explore is the weekly returns of the Dow Jones Index.
    10. Boston Housing Data: The Boston Housing Data Set contains median housing prices in Boston suburbs as well as 13 attributes that contribute to those prices. It’s an excellent set for experimenting with various types of regressions.
    11. Enron Emails: After the collapse of Enron, a dataset of roughly 500,000 emails with message text and metadata were released. The dataset is now famous and provides an excellent testing ground for text related analysis. It has the messiness of real world data.
    12. Google N-Grams: If you’re interested in truly massive data, the Google n-grams dataset counts the frequency of words and phrases by year across a huge number of text sources. The resulting file is 2.2 TB.
    13. Sentence Sentiments: Researchers have labeled 3,000 sentences as expressing positive or negative sentiments. If you’re interested in classifying text, this is a great place to start.
    14. Reddit Comments: Reddit released a dataset of every comment that has ever been made on the site. That’s over a terabyte of data uncompressed, so if you want a smaller dataset to work with Kaggle has hosted the comments from May 2015 on their site.
    15. Wikipedia: Wikipedia provides instructions for downloading the text of English language articles.
    16. Lending Club: Lending Club provides data about loan applications it has rejected as well as the performance of loans that it issued. The dataset lends itself both to categorization techniques (will a given loan default) as well as regressions (how much will be paid back on a given loan.)
    17. Walmart: Walmart has released store level sales data for 98 items across 45 stores. This is an excellent data for time series analysis and has interesting seasonal components as well.
    18. Airbnb: This website offers different datasets related to Airbnb and listings related to different cities.
    19. Yelp: Yelp releases an academic dataset that contains information for the areas around 30 universities.
weinxin
扫码,关注科塔学术公众号
致力于成为国内领先的科研与学术资源导航平台,让科研工作更简单、更有效率。内容专业,信息准确,更新及时。
  • 本文由 发表于 2018年6月23日12:32:20
  • 转载请务必保留本文链接:https://www.sciping.com/9233.html
科技资源描述模型和建立方法研究 学术文献

科技资源描述模型和建立方法研究

顾复 刘杨圣彦 顾新建 浙江大学机械工程学院工业工程研究所 摘要: 科技创新是我国发展的关键途径,需要科技资源的共享和协同创新。科技资源共享是一个系统工程,需要建立科技资源的描述模型,在此基础上进行科...
科技资源共享的需求、内容、方法体系框架 学术文献

科技资源共享的需求、内容、方法体系框架

顾新建 杨青海 顾复 代风 纪杨建 浙江大学工业与系统工程系 中国标准化研究院高新技术标准化研究所 摘要: 建立科技资源需求、内容、方法体系框架,包括(1)科技资源共享需求体系,主要是高端化、个性化、...
科技资源及其分类体系研究 学术文献

科技资源及其分类体系研究

董明涛 孙研 王斌 东北大学秦皇岛分校经贸学院 东北大学秦皇岛分校社会科学研究院 摘要: 正确把握科技资源的内涵并构建起系统的分类体系, 有利于科技资源的整合与共享, 也是提高科技资源共享效率的有效手...
大数据:发展现状与未来趋势 学术文献

大数据:发展现状与未来趋势

中国特色社会主义进入新时代,实现中华民族伟大复兴的中国梦开启新征程。党中央决定实施国家大数据战略,吹响了加快发展数字经济、建设数字中国的号角。习近平总书记在十九届中共中央政治局第二次集体学习时的重要讲...