Course Review - Dataquest, Data Scientist in Python
This course sequence, offering foundational knowledge for becoming a Data Scientist, is offered at the following link: https://www.dataquest.io/. The material is a very, very tall undertaking in terms of the sheer amount of content covered. However, as you go through the content, one thing is made abundantly clear… becoming a Data Scientist is no easy feat, and the Dataquest material is merely an introduction. Below, I will provide a brief review of the course, and then dive into a deeper breakdown of the content covered.
The “Quick and Dirty” Review
This was a great set of courses to get you some basic familiarity with most of the things that a Data Scientist has to do on a daily basis. I highly recommend the course sequence, which has both some very important and tangible benefits, as well as a few drawbacks, which I will touch on now.
One of the key benefits of the Dataquest material is that you are exposed to many, many methodologies, necessary “tools of the trade” and broad exposure to what it takes to work with data. Throughout the course material, you work through a large number of challenges and guided projects which are designed to allow you to apply what you are learning. The course sequence and structure feels extremely well organized and is easy to follow. You have a chance to apply just about every concept you learn, which is very important. Furthermore, going through these projects gives you an opportunity to build a portfolio, which is both great to showcase to potential employers and will serve as a reference as you dive into your own projects and need something to refer back to. Lastly, for those who prefer written content, each course is comprised of just that (text). There are no videos, so if you’re anything like me, you can sit down for a couple hours at a time, put some music on, and get yourself into a good flow with the material.
There are also a couple drawbacks to the style of learning on Dataquest. Just as the text-only style of learning can be a benefit to some, others may find it a bit too dry and are looking for video instruction. After more than a hundred hours of reading instruction and prompts on one side of a screen, and coding on the other, I admit that I started to burn out toward the end and feeling like I just wanted to get through the material. At $49 dollars a month for the Premium Subscription (required for Data Scientist and Data Engineer paths), it’s a little bit hard to not just want to wrap the material up at some point. One of the other drawback is that some topics come up frequently and are used often, and others are short-lived, and then go away for the rest of the course sequence. This makes some topics stick, while others fade from memory in order to make room for the more frequently visited topics. That being said, I imagine this is a challenge with almost any course! Knowing how much material was a part of this “Data Scientist in Python” path, it’s hard to imagine how much longer it would have taken to drill each topic multiple times. I think you just have to understand that going through this “flavor” of material is not a one-and-done thing. I will probably have to go through a couple courses on how to use Git and apply them in my own projects before I consider myself an expert. This material served as a great primer for things I should be thinking about as I gear up for the next Data Science education I decide to take up. Next, I’ll dive into a little bit deeper of an explanation of the course sequence, to give you a good sense of what was covered.
A Thorough Review
The Data Scientist on Python track starts out with a thorough introduction to programming in Python. From basic things like printing your first statement, “Hello, world!” to working with loops and writing functions, the beginner and intermediate courses on Python are geared toward giving you a strong enough fundamental knowledge to work in Python. This was very well structured, and is consistently being updated and revamped to give you the best instruction possible.
Once you have a good basis in programming with Python, you get your first taste of data analysis and visualization by learning to work with Pandas, Numpy, Matplotlib and Seaborn (all Python libraries). Using the combination of these packages allows you to seamlessly manipulate arrays and dataframes, and then create visualizations that help you to understand patterns in the data that can be used to tell a story. Exploratory data analysis (EDA) and visualization is a pivotal part of becoming a data scientist, as simply looking at statistical properties to gain inference from data can at times be misleading. The EDA section of the material caps off with a course on data cleaning.
After gaining a good grasp of how to explore data with some of Python’s most popular libraries for working with data structures and creating visualizations, you are exposed to working with the command line. Through an introductory and intermediate course on the command line, and then a course on Git, you gain a good understanding of how to manipulate files and incorporate version control into your data science workflows. Both of these are extremely important elements to working with data, particularly at scale.
Once you have been exposed to navigating the command line and familiarizing yourself with version control, you jump into one of the most crucial topics in any data-related discipline, which is working with data sources. Specifically, you are taken through a number of courses on SQL, from beginner to advanced level. As working with databases remains the predominant way that any data professional will gather the data necessary to do an analysis, this is not a series of courses to miss. Although a good amount of this material was review for me, I still learned some new things and had the opportunity to reinforce concepts. The series of courses centered on data sources culminates with content on APIs and web scraping. This is a very quick introduction, and I got the sense that being able to successfully scrape the web for data would require a much deeper dive and significant practice to perform successfully.
After the courses on working with data sources, another valuable topic to data science and research roles is introduced: statistics and probability. Being able to interpret the results of a regression, an A/B test, or to design experiments are all centered on a data scientist’s ability to understand and leverage foundational statistics concepts. This material was very helpful for me, as in my experience, there is never “enough” mathematical study. Because I haven’t held a role where I regularly had to apply these concepts, the rigor with which this material is presented allowed me to really solidify the concepts.
Now that you’ve developed your statistical acumen, some of the “fun stuff” begins, in the expansive world of Machine Learning. You are taken through the following courses (in order): Machine Learning Fundamentals, Calculus for Machine Learning, Linear Algebra For Machine Learning, Linear Regression For Machine Learning, Machine Learning In Python: Intermediate, Decision Trees, Deep Learning Fundamentals, and lastly, Machine Learning Project. This is A LOT of material, and will take you nothing short of a few weeks or longer to get through it all. Generally speaking, I think all of this material does a good job of painting the picture of what Machine Learning is and the requisite mathematical concepts, but doesn’t offer much in terms of deep coverage in any given topic. If you don’t have a strong mathematical background, I would expect that the material in calculus and linear algebra may miss you. I was lucky to have some remnants of that material from academia to draw on. However, even with my background in this content, I’ll seek out more specifically focused courses on each of these topics to really get them down.
After your foray into the world of Machine Learning, you are exposed to some advanced Python and Computer Science topics. Two courses are represented in this section, including one on data structures and algorithms, and the other on advanced Python. Similar to the Machine Learning material outlined above, these two courses leave a bit to be desired in terms of depth. However, getting introduced to topics like recursion, search and sort algorithms, etc. are extremely valuable. You can see my review on CS50 if you are looking to gain a thorough grasp of these concepts.
Next, the course sequence has you dive into a few advanced topics in data science. Namely, you take courses on Kaggle Fundamentals, Exploring Topics In Data Science, and Natural Language Processing. Each of these are great courses for further understanding what all of this Data Science material can do for you. Kaggle, for example, is a site and community where you can compete with peers in industry and academia to solve machine learning problems. The algorithms with the highest predictive accuracy win, and often provide employment opportunities or very lucrative rewards. You finish these courses with some great platforms and use cases by which you can go and continue to hone your skills.
The last section of the Data Science in Python Specialization is on working with large datasets, through a course called “Spark and Map-Reduce.” This course introduces the ever expanding world of distributed processing. This essentially allows your computer (or a set of computers) to block out chunks of memory to handle data processing separately, and then combine/aggregate the results. This makes processing very large (memory-intensive) tasks more manageable by your machine. For those who are completing tasks such as training neural networks and the like, being equipped with the knowledge to leverage distributed processing through tools like Spark is extremely important.
Overall, this was an incredible learning journey. I learned more information than I know what to do with, and had the opportunity to really continue to reinforce certain concepts continuously. The biggest challenge, I feel, is that the more advanced topics are seemingly glossed over. This would present an obstacle if you were going to be interviewing for a data science role and having to rattle off knowledge like how gradient descent works, or how to run a Python script from the command line. To be fair though, I think the level of depth was about right in terms of keeping the learner (me/you) engaged. I did find it challenging a the end to keep the same level of enthusiasm for the content as days turned to weeks and then months of similarly formatted learning content. Going through another, similar course sequence like DataCamp would be a great way to reinforce these concepts, which I intend on doing.
Please feel free to reach out with any questions or comments, and best of luck on your learning journey.