19 Free Public Data Sets For Your First Data Science Project

2018年6月23日12:32:2019 Free Public Data Sets For Your First Data Science Project已关闭评论1,083 views 3783字阅读12分36秒
摘要

Completing your first project is a major milestone on the road to becoming a data scientist. It’s also an intimidating process. The first step is to find an appropriate, interesting data set. You should decide how large and how messy a dataset you want to work with; while cleaning data is an integral part of data science, you may want to start with clean dataset for your first project so that you can focus on the analysis rather than on cleaning the data.

Completing your first project is a major milestone on the road to becoming a data scientist. It’s also an intimidating process. The first step is to find an appropriate, interesting data set. You should decide how large and how messy a dataset you want to work with; while cleaning data is an integral part of data science, you may want to start with clean dataset for your first project so that you can focus on the analysis rather than on cleaning the data.

Based on the learnings from our Introduction to Data Science Course and the Data Science Career Track, we’ve selected datasets of varying types and complexity that we think work well for first projects (some of them work for research projects as well!). These data-sets cover a variety of sources: demographic data, economic data, text data, and corporate data.

    1. United States Census Data: The United States Census publishes reams of demographic data at the state, city, and even zip code level. The data set is fantastic for creating geographic data visualizations and can be accessed on the Census Website. Alternatively, the data can be accessed via an API. One convenient way to use that API is through the chloroplethr. In general, this data is very clean and very comprehensive.
    2. FBI Crime Data: The FBI crime data set is fascinating. If you’re interested in analyzing time series data, you can use it to chart changes in crime rates at the national level over a 20 year period. Alternatively, you can look at the data geographically.
    3. CDC Cause of Death: The Center for Disease Control control maintains a database on cause of death. The data can be segmented in almost every way imaginable: age, race, year, and so on.
    4. Medicare Hospital Quality: Medicare maintains a database on complication rates by hospital that provides for interesting comparisons.
    5. SEER Cancer Incidence: The US government also has data about cancer incidence, again segmented by age, race, gender, year, and other factors.
    6. Bureau of Labor Statistics: Many important economic indicators for the United States (like unemployment and inflation) can be found on the Bureau of Labor Statistics website. Most of the data can be segmented both by time and by geography.
    7. The Bureau of Economic Analysis: The Bureau of Economic Analysis also has national and regional economic data, like GDP and exchange rates.
    8. IMF Economic Data: If you want a view of international data, you can find it on the IMF website.
    9. Dow Jones Weekly Returns: Predicting stock prices is a major application of data analysis and machine learning. One dataset to explore is the weekly returns of the Dow Jones Index.
    10. Boston Housing Data: The Boston Housing Data Set contains median housing prices in Boston suburbs as well as 13 attributes that contribute to those prices. It’s an excellent set for experimenting with various types of regressions.
    11. Enron Emails: After the collapse of Enron, a dataset of roughly 500,000 emails with message text and metadata were released. The dataset is now famous and provides an excellent testing ground for text related analysis. It has the messiness of real world data.
    12. Google N-Grams: If you’re interested in truly massive data, the Google n-grams dataset counts the frequency of words and phrases by year across a huge number of text sources. The resulting file is 2.2 TB.
    13. Sentence Sentiments: Researchers have labeled 3,000 sentences as expressing positive or negative sentiments. If you’re interested in classifying text, this is a great place to start.
    14. Reddit Comments: Reddit released a dataset of every comment that has ever been made on the site. That’s over a terabyte of data uncompressed, so if you want a smaller dataset to work with Kaggle has hosted the comments from May 2015 on their site.
    15. Wikipedia: Wikipedia provides instructions for downloading the text of English language articles.
    16. Lending Club: Lending Club provides data about loan applications it has rejected as well as the performance of loans that it issued. The dataset lends itself both to categorization techniques (will a given loan default) as well as regressions (how much will be paid back on a given loan.)
    17. Walmart: Walmart has released store level sales data for 98 items across 45 stores. This is an excellent data for time series analysis and has interesting seasonal components as well.
    18. Airbnb: This website offers different datasets related to Airbnb and listings related to different cities.
    19. Yelp: Yelp releases an academic dataset that contains information for the areas around 30 universities.
weinxin
扫码,关注科塔学术公众号
致力于成为国内领先的科研与学术资源导航平台,让科研工作更简单、更有效率。内容专业,信息准确,更新及时。
  • 本文由 发表于 2018年6月23日12:32:20
  • 转载请务必保留本文链接:https://www.sciping.com/9233.html
2004年中国重大科学、技术与工程进展 学术文献

2004年中国重大科学、技术与工程进展

继 2004 年第 3 期《科技导报》遴选公布"2003 年中国大科学 、技术与工程进展“, 本刊编辑部 继续对 2004年我国的重大科学技术进展进行 盘点, 遴选出该年度中国 9 项重大科学进展、8...
2003年中国重大科学、技术与工程进展 学术文献

2003年中国重大科学、技术与工程进展

摘要: 本文对2003年中国重大科技进展进行了盘点 , 按“科学”、“技术”和“工程”3个类别 , 遴选出2003年中国5项重大科学进展、8项重大技术进展和4项重大工程进展 , 并进行了相应的点评。 ...
2020年中国重大科学、技术和工程进展 学术文献

2020年中国重大科学、技术和工程进展

导 读 《科技导报》编辑部从国内外重要科技期刊和科技新闻媒体2020年1月1日至12月31日间发表、公布或报道的中国科技成果中,遴选、推荐30项重大科学进展、30项重大技术进展、49项重大工程进展候选...
2019年中国重大科学、技术和工程进展 学术文献

2019年中国重大科学、技术和工程进展

陈广仁,刘志远,祝叶华,徐丽娇 《科技导报》编辑部,北京100081 收稿日期:2020-01-27;修回日期:2020-02-10 作者简介:陈广仁,编审,研究方向为科技哲学、科技传播,电子信箱:c...