LongCut logo

Intro to Data Science: What is Data Science?

By Steve Brunton

Summary

Topics Covered

  • Don't End Up Data Creek Without a Paddle
  • Data Science Is About People, Not Data
  • Break Down the Data Science Firewall

Full Transcript

welcome back so we're talking about data science this kind of intro overview of data science and I've really emphasized that this is about asking questions and then you know questions that you can

answer with data and this being a feedback a feedback cycle so we're going to start to talk about what questions you can ask and get into a little bit more detail soon I want to talk a little

bit about the data for now kind of the various things we'll do with data a lot a lot a lot of data science goes into

collecting the data storing the data cleaning the data processing the data so

collecting curating you know storing cleaning the data are huge aspects of

data science so a lot of database engineering and management a lot of data processing algorithms cleaning data outlier identifying outliers and filling

them in what if you have missing data you know real world data is messy and so the collecting curating and cleaning of data is absolutely critical I should

probably also say part of this this feedback is identifying the right data to start answering these questions so

you have to identify do you have the data can you get the data and then start collecting curating and cleaning that data okay so this is kind of database

engineering and then what you want to do downstream of that are things like visualize so you want to visualize you

want to analyze and you might want to model build models for prediction you might want to use mum to build models that can be used and deployed in the

real world this is kind of where machine learning comes in so machine learning is super exciting lots of interest in developing these models from data but

you have to go through all of this hard work first so tons and tons of effort here to get to the I know for me I really like visualizing the fun part of vision analyzing and modeling this is

interesting in fun too but this is kind of you know goes in this in this direction and there's feedback though at every step once you visualize the data you've been collecting you might decide

that you need to modify your data collection or the way that you're storing or you know to ask different questions once you start analyzing and modeling again there might be feedback

loops and all levels of this process and this is not kind of a static architecture this is a dynamic process again in the hands of expert humans

teams of humans asking questions that can be answered with data so I think of data science as kind of data-driven inquiry or the science of how to to do

that data-driven inquiry and I'll be blunt a lot of times when I talk to companies for example in consulting they spend a

lot of time and effort a majority of their time and effort in this side and this kind of data engineering and I've heard all kinds of analogies and stories you're gonna build a data Lake and

you're gonna have all these data rivers going into data Lake you know this big warehouse where you're gonna hold all of your data and I must caution you if you don't invest in the analysis the

modeling the visualization you're gonna find yourself up data creek without data paddle okay so you need to do this this is critical but you also need to be

investing in kind of the analytics and the modeling and these these rich things you can do with your data and I think visualization is is so important because

we're inherently visual you know beings and this visualization also is all about communication so if you have a great

idea but you can't communicate it to your boss or to your team you know it wasn't that great of an idea and so visualization is is also a critical

aspect of how how this feedback feedback cycle works and that's that's another kind of point I want to make is that this is not data science is not about

the data it's not about the algorithms it's about the people and the tea and the problems you're gonna solve it's questions you're asking it's problems

you want to solve problem solving and it's an inherently collaborative art so

data science involves teams of people collaborating so teams of experts and you know in the in the traditional model

of an expert you might have you know we call these t-shaped people so you have some breadth you know not just one thing but lots of things and you have a lot of depth in one area

maybe this depth is in designing you know Automotive aerodynamic streamlined bodies or maybe this is in aircraft

design or space mission design or whatever it is that you're interested that you do you know for me it's fluid flow control and fluid modeling but increasingly what we're going to find in

the future in these teams of data scientists or sorry data scientists on teams problem-solving teams is that people will need to develop a second

depth area in data science so this is what people call pie-shaped people pie-shaped experts so you still need your breath and your domain expertise

but now you have this kind of data science expertise and so I really want to emphasize data scientists are not just going to be sitting across a

firewall from the people collecting the data and solving the problems you want data scientists on your team in the room helping you make decisions in fact ideally you want the experts who are

collecting the data and doing the problem-solving to have data science expertise so this is going to be like computer literacy you don't just have

everybody who solves problems and then people who only under you know only a few people who understand how computers work and a firewall you have your team of expert problem solvers and they all

have a basic degree of computer literacy and then you have kind of experts who are on the team okay and that's how we see kind of data science evolving

being a core functionality that your whole team you're gonna have to integrate this into into your team okay other things that are really important are these aspects of reproducibility

reproducibility in science there has been a crisis that lots of results that had been published when other groups of scientists try to reproduce them they

can't and that means there's something wrong in the way that we are conducting and conveying scientific data and so this idea of reproducibility is really

ubiquitous we want the processes that we spend our lives developing and improving to be reproducible by others we want reliability of the things we learn and

the things we optimize and design and so reproducibility is also critical in data science how do you collect enough data that your processes are reproducible that you're staying you have standards

in some sense okay so lots of lots of things to think about it's about this feedback you have to do a lot of work collecting your data and curating and cleaning but then you get

to analyze and visualize and communicate you get to solve problems with data on teams and hopefully your teams of experts are not kind of segregated into

domain experts and data scientists but you have people who have this kind of mixed capability of domain expertise and data science expertise okay we're going to keep going into more depth we're

going to talk about the various aspects of this in more detail thank you

Loading...

Loading video analysis...