Intro to Data Science: What is Data Science?
By Steve Brunton
Summary
Topics Covered
- Don't End Up Data Creek Without a Paddle
- Data Science Is About People, Not Data
- Break Down the Data Science Firewall
Full Transcript
welcome back so we're talking about data science this kind of intro overview of data science and I've really emphasized that this is about asking questions and then you know questions that you can
answer with data and this being a feedback a feedback cycle so we're going to start to talk about what questions you can ask and get into a little bit more detail soon I want to talk a little
bit about the data for now kind of the various things we'll do with data a lot a lot a lot of data science goes into
collecting the data storing the data cleaning the data processing the data so
collecting curating you know storing cleaning the data are huge aspects of
data science so a lot of database engineering and management a lot of data processing algorithms cleaning data outlier identifying outliers and filling
them in what if you have missing data you know real world data is messy and so the collecting curating and cleaning of data is absolutely critical I should
probably also say part of this this feedback is identifying the right data to start answering these questions so
you have to identify do you have the data can you get the data and then start collecting curating and cleaning that data okay so this is kind of database
engineering and then what you want to do downstream of that are things like visualize so you want to visualize you
want to analyze and you might want to model build models for prediction you might want to use mum to build models that can be used and deployed in the
real world this is kind of where machine learning comes in so machine learning is super exciting lots of interest in developing these models from data but
you have to go through all of this hard work first so tons and tons of effort here to get to the I know for me I really like visualizing the fun part of vision analyzing and modeling this is
interesting in fun too but this is kind of you know goes in this in this direction and there's feedback though at every step once you visualize the data you've been collecting you might decide
that you need to modify your data collection or the way that you're storing or you know to ask different questions once you start analyzing and modeling again there might be feedback
loops and all levels of this process and this is not kind of a static architecture this is a dynamic process again in the hands of expert humans
teams of humans asking questions that can be answered with data so I think of data science as kind of data-driven inquiry or the science of how to to do
that data-driven inquiry and I'll be blunt a lot of times when I talk to companies for example in consulting they spend a
lot of time and effort a majority of their time and effort in this side and this kind of data engineering and I've heard all kinds of analogies and stories you're gonna build a data Lake and
you're gonna have all these data rivers going into data Lake you know this big warehouse where you're gonna hold all of your data and I must caution you if you don't invest in the analysis the
modeling the visualization you're gonna find yourself up data creek without data paddle okay so you need to do this this is critical but you also need to be
investing in kind of the analytics and the modeling and these these rich things you can do with your data and I think visualization is is so important because
we're inherently visual you know beings and this visualization also is all about communication so if you have a great
idea but you can't communicate it to your boss or to your team you know it wasn't that great of an idea and so visualization is is also a critical
aspect of how how this feedback feedback cycle works and that's that's another kind of point I want to make is that this is not data science is not about
the data it's not about the algorithms it's about the people and the tea and the problems you're gonna solve it's questions you're asking it's problems
you want to solve problem solving and it's an inherently collaborative art so
data science involves teams of people collaborating so teams of experts and you know in the in the traditional model
of an expert you might have you know we call these t-shaped people so you have some breadth you know not just one thing but lots of things and you have a lot of depth in one area
maybe this depth is in designing you know Automotive aerodynamic streamlined bodies or maybe this is in aircraft
design or space mission design or whatever it is that you're interested that you do you know for me it's fluid flow control and fluid modeling but increasingly what we're going to find in
the future in these teams of data scientists or sorry data scientists on teams problem-solving teams is that people will need to develop a second
depth area in data science so this is what people call pie-shaped people pie-shaped experts so you still need your breath and your domain expertise
but now you have this kind of data science expertise and so I really want to emphasize data scientists are not just going to be sitting across a
firewall from the people collecting the data and solving the problems you want data scientists on your team in the room helping you make decisions in fact ideally you want the experts who are
collecting the data and doing the problem-solving to have data science expertise so this is going to be like computer literacy you don't just have
everybody who solves problems and then people who only under you know only a few people who understand how computers work and a firewall you have your team of expert problem solvers and they all
have a basic degree of computer literacy and then you have kind of experts who are on the team okay and that's how we see kind of data science evolving
being a core functionality that your whole team you're gonna have to integrate this into into your team okay other things that are really important are these aspects of reproducibility
reproducibility in science there has been a crisis that lots of results that had been published when other groups of scientists try to reproduce them they
can't and that means there's something wrong in the way that we are conducting and conveying scientific data and so this idea of reproducibility is really
ubiquitous we want the processes that we spend our lives developing and improving to be reproducible by others we want reliability of the things we learn and
the things we optimize and design and so reproducibility is also critical in data science how do you collect enough data that your processes are reproducible that you're staying you have standards
in some sense okay so lots of lots of things to think about it's about this feedback you have to do a lot of work collecting your data and curating and cleaning but then you get
to analyze and visualize and communicate you get to solve problems with data on teams and hopefully your teams of experts are not kind of segregated into
domain experts and data scientists but you have people who have this kind of mixed capability of domain expertise and data science expertise okay we're going to keep going into more depth we're
going to talk about the various aspects of this in more detail thank you
Loading video analysis...