Meet The Data Scientists: Germany and USA - The Ecosystems of Data Science for the Common Good
By Hertie School Data Science Lab
Summary
Topics Covered
- Data Science Expands Established Policy Tools
- Master Causal Inference Despite ML Hype
- Machine Learning Targets Prediction Not Causation
- Specialize in Policy Domain Over General Skills
Full Transcript
thank you everyone's joining us um greeting everyone um my name is sweet i'm from the herdy school data science lab uh welcome to our meet the data science test event series i'm just gonna admit a few other people
into the chat um and so this is a series of talks on different topics of life work and research from contemporary data scientists it's to give us a better understanding of where the industry is
heading as well as maybe some of the popular skills that is needed in order to be successful in this field and maybe some of the challenges that are facing data professionals in our technology-driven societies today and so
we are very excited to have um alex anglers as our speakers for today's event alex is currently visiting berlin as a researcher and also as a
fulbright truman innovation scholar and strip to make a tour senior fellow he's researching the overlap of data science and governance in germany and he previously worked as a senior data
scientist in different organizations in the u.s as well as he helped to run the
the u.s as well as he helped to run the masters of science in computer computational analysis and public policies in uchicago and he also helped to design the masters of data science
and public policies at georgetown as well and for today he will be talking about the ecosystems in germany and the u.s
that supports data science for the common goods and he will be delivering his talk and afterwards there will be some time for q a with the audience so um alex the floor is all yours
um thanks so much for the intro hi everybody um i uh excited to be here i've been involved in programs like this for a while um and
i think maybe the thing that i'm most proud of is the marginal impact of helping people who are passionate about data and policy to to do good um
in fact really i care about that much more than i care about any specific thing that i'm going to talk about um uh and i mean that really specifically which is to say if i say something that is interesting and valuable to you a
relatively small group on this call i welcome you to interrupt at any point just talk just immediate yourself and talk and i will be happy to go down a little rabbit hole with you in that direction to the extent that it's um
generally valuable that is just more important to me um so that was my first
thought i'm gonna share some slides um to talk about and i should apologize a little bit because this um
probably isn't as much about the german ecosystem um and the usa ecosystem as i originally thought it was going to be um and a little bit
more about just the dance and policy analysis ecosystem broadly um because that's easier to tie to things that i think are valuable to you all
um and specifically stuff that is valuable for like what you do with your career um so i have two parts one
what's going on in policy um in policy research especially the a little bit in governments as well and two why does that matter for you as students and i think hopefully
i will draw um a clear line between uh this this sort of overview and really specific things that you should do in your career that will make you more competitive for jobs
um so that's my goal that being said again feel free to jump in and be like hey this is interesting more about this um i just got a great intro so i'll just
mention i've been doing this stuff for a while as a data scientist in government and in think tanks as well as teaching it at georgetown and at uchicago and johns hopkins
and these are the programs that i've worked with most youth chicago which named its program master of science and computational analysis and public policy because it was like right before the word data
science got popularized and then uh at georgetown the more we recent masters in data science public policy which is the name of herdy alex i'm sorry can you make the
slide full screen yeah let's do that great everyone see that okay yes looks good thanks okay so um a
canonical um points about data science uh in policy and otherwise that it is not new um alice rivlin who
ran both the united states's congressional budget office and office management budget which are two big empirical tanks uh empirical agencies in the government wrote a book in 1971 called
systemic thinking for social action and at the end it is really about that this is the last sentence put more simply to do better we must have a way of distinguishing better from worse
and she's talking about the methods and tools at that point that they're using um and i want to tie this really specifically to my talk um which is when we talk about data science in the
context of public policy it is absolutely a mistake to say all of this is new and that this is a like a brand new set of tools and i think it's
actually valuable to have in your head some of the history here some of the this is new and valuable in this new way and this has been around a long time and
it operates in its own slot of policy analysis most of the new things have not replaced the old things in fact we have new things that are doing new things and old things that are doing old
things and the overlap is still very very muddled um the way that for instance machine learning is changing program evaluation so far is not very much just a little
bit margins um okay more tangibly this means that i think it's valuable to have a definition or a working
understanding of data science as an expansion right the term really means we've started to do more things doesn't mean we just started something it means an expansion of what falls into
um our use of data there's three things i bolded here in the top left they have been around for a long time experiments causal inference micro simulations i would also add descriptive
statistics as survey methods um i would say somewhere in the realm of all of these from 40 to 70 years at least descriptive statistics even much further back than that
and so again all the time this data is really what we're talking about data science we talk about data science though often we're sort of meaning oh well there's new stuff we're sort of all encompassing
and here's a bunch of examples so what i'm going to do is talk about some new and old of data science at policy institutions and then
zoom back out to why this matters for you all um i can't see your faces which is okay if you wanna um share your faces you're welcome to um
the other thing that i'm curious about is i don't know your backgrounds as well so there are parts of this that could be completely known
um and you should uh feel free to chime in and say that and there's parts that may be totally alien and i welcome you to drop into chat or to raise your hand and say okay can you explain a little
more i'm a little bit guessing at um where people are in their sort of data science education there's going to be some terminology i will try not to lead to a part of it
okay um experiments uh just really quickly experiments are still really valuable they are still the gold standard of policy information we care if public policy works
um a really famous one from the united states the state of oregon decided to give out more medicaid coverage this is more free health care to low-income people
they realized they did not have enough money to do what they originally intended to do and so they said oh crap we'll do a lottery and we'll give it to some people
and not to other people randomly this is a slightly insane thing for a state to do does not happen very often but social scientists realized hey we can track the outcome of the people who
randomly got medicaid versus those who applied but didn't get it and see how it helped and how it didn't help this is actually a random experiment though many of you are probably familiar with quasi experiments
or something that feels experimental but isn't quite um also qualifies for this and you they follow these people over time and it told us important things
that um medicaid does improve uh health status for instance um with levels of cholesterol and therefore diabetes later um but it doesn't actually make health
care cheaper didn't reduce emergency services so much that things actually got cheaper in that sense and maybe this doesn't matter here because everyone has health care but it was a pretty big deal and um do you ask and this ended up being an incredibly
impactful study experimentation still hard because it's expensive and often unethical but still the gold standard hasn't gone
anywhere part of data science um causal inference especially around program evaluation this is probably still to this day the most common type
of data science is the most important part it is the the majority of empirical evidence in policy um and i think even if you want to be a data scientist
you are attracted attracted to natural language processing and web data and machine learning you have to understand this stuff typically data that was not meant for um
was not meant for whatever you're studying but you find it you observe changes over time and you try to parse alcohol inference as in did a specific policy change um
or intervention caused something else a rise in income and improvement in uh in health outcomes whatever it might be the reason i mentioned this so
explicitly you know is because as i said before we have a whole bunch of cool other new data science things and really none of them answer this question right you shouldn't confuse the fact that we have all these new methods with
that we fundamentally changed our process of causal inference where we care about if a change specifically in policy analysis
a policy change caused something new now if you follow um little treats for those who are super interested if you follow the work of student susan athey um she's a
very famous economist and states she is starting to push the boundaries of where these two things overlap as there are other people but most of the time the vast majority of causal influences so far not been substantially changed by
data science and those efforts remain new um there's no way to get like a quick poll uh if anyone's listening and you know what micro simulations are can you type
just like the letter y into chat or just like yes i think doing this affirmatively is fine how many people here have understand what micro simulations are i will wait as you
nobody and i see some notes that's what knows that's helpful a1 yes all right cool all right that's i would actually expect the answer to be close to close to no one um
so that's not that's not a critique of you all it's just um it's just not a thing we talked about much micro simulations are
rare um they are expensive and they're incredibly valuable and we put a lot of weight on them in the areas where they're used
you can think about micro stimulation as a program it is not often very statistical it is usually a very complicated set of rules it's
usually a very complex rules-based program that models something very complex typically we use it with money
we use it where a specific change has a guaranteed outcome on money and so we can change something and model what happens if we move this thing around and we will see some effect um in the u.s
benefits policy in almost all countries tax policy there are other things that you can use micro simulations for um but those are the big ones often they're
included in budgets too and budget and the budget outcomes of legislation sometimes more specific things to behavior um we do have health insurance
models uh the cost of social security often um is the micro stimulation the phrase micro means they work on micro data um you know individual row level
administrative data and the simulation is that it's a guess if we change this piece of tax policy how much money do we get in how much money do we lose
um also a really core part of data science i would say we use you know every time up in the u.s
a president has a tax plan it goes through some micro simulation model and it gets wall-to-wall coverage in our in our press bernie sanders tax plan would cost this much money um you
know for hillary clinton's tax plan would uh uh cut this bunch of trumps to xyz right these are really important um you probably won't read about them they are very old the original micro simulations
were built on punch cards um i did the you know the very beginning of computing um they are then five or eight or ten thousand lines of code um and they're not learning
processes there are rules based to say if you change this tax policy this person gets this much less or more money and that has this effect overall on the national government and so this is actually a lot this is
actually a huge amount of data stuff right i even mentioned survey methods right which totally counts in this too and i feel like we should have this in our heads when we start talking about the rest of data science the stuff that
gets our attention that has led in some ways to the creation of programs like yours and programs like the ones at georgetown and chicago and other places and i've also
that have brought this renewed attention but it's important to recognize what's been here for a long time happy to stop there for just a second any questions on that i just threw broad concepts at you um maybe something you've never heard of anyone want to
chime in and be like that was dumb you're dumb or like that was useful and i have a problem yeah alex i'm there in the chat um but i think oh yeah stay
for later and one directly um connected to this um so um seal of you uh asks are not microsimulations similar to a b testing
okay that's a b the microscopic testing is a really important point in a b testing you were actually a b testing is actually analogous to experimentation so
a b testing you actually give two different treatments generally randomly and you see how things change um we really common example of a b testing would be i am
running a political campaign i am trying to get donations so i send some people one email and other people another email and then i look which of these two emails got me more money
from these donors and then maybe you repeat the process next time a micro simulation the stakes are way too big to do that the the stakes here are how much are
people paying in taxes you cannot randomize this at any meaningful scale it's an enormous question it has huge changes you might update your tax laws every 15 years or something right it's
too big of a change and it's you can't do it randomly for some another so it's entirely data driven so that's the difference between micro stimulation and aab testing um programming languages for micro
simulation they're often written slowly so they're often a generation behind you'll find like some written in fortran some written in c plus plus some written in coble these days and then hopefully
the current generation is being written in r and python and julia and that will be outdated by the time they're done um my god fortran say what you want about fortran it's fast
um your your linear algebra the libraries are written in fortran someone's still using um quantitative data recommend i'm going to come back to the administration
question i think that's a little too specific at the moment i'm happy to come back to that a little bit um and i see someone has their hand raised you want to chime in [Music] just a question um
specifically um i came from a social background so i'm not exactly familiar with the technology language but is this uh like a software or some kind of uh digital environment where you put some
inputs and you observe the results or it's uh like a statistical method what what's a micro simulation itself yeah yeah so you can just think of this as a program a parameterized program a
really really complicated function so um the tax policy model in the u.s has 10 000 parameters you can change 10 000 options you can change before you run it
um one example of that is really simple i'm going to take the amount of money that very high income people make and i'll raise it by five percent and then you run the whole model again and you
see how it changes the income distribution and how much money you get in so it's really a function um but it's a and as an individual user you see it as a function
um with many many options but what it's really running is a simulation of the entirety of a tax of a nation's tax policy or against preventative health insurance
signals right so this should be only specific a micro simulation can only be designed for one country one government one specific at a time you cannot simulate different results on different
entities i guess yeah there are people who've done it at the state level in the us um but even that was hard and but and you still end up with essentially one model for each state there's some other codes reusable but yeah but yeah one one
at a time yep good question that's one more question from ajit um do you also capture mental maps of
individuals for micro simulations um mental maps of individuals like uh hold on does that mean like i'm in my head i'm
interpreting that as um [Music] behavioral changes um and by behavioral changes uh you can imagine coding yeah great so you
can imagine encoding in the idea that as i just said if you raise the marginal tax rate for wealthy people by five percent you shouldn't necessarily assume that they work exactly the same amount
or do the same thing with their income right you can imagine instead that if you did that enough they would either work less or as we've seen from wealthy people put their income into a different
form right to get more of it through capital gains or something else or through business expenses um that is a big long-running conversation in the u.s
we have slowly started to implement behavioral changes um but the the major answer is no that there are very few
behavioral adjustments in these models and if you're saying like wait that makes them bad you're completely right if you make big enough changes in a micro simulation model that
you foresee big changes in um behavior the models don't work anymore in fact the models have many flaws the reason that they're valuable is because
when you're talking about the scale of changes about messing with tax policy the alternative to these is guessing and guessing is definitely useless so
while these models are this is a great example of all models are wrong but some are used are useful micro simulation models their intent is to give you a vague sense of change
better than a human guess to which i think they succeed um but but we haven't been able to make them better or much better with behavioral modeling so only a little bit it's a really good question
um and if you follow up i'm happy to send some resources on thinking about this okay cool all right i'm happy to stop with more questions at any time but that's a little wrap up of like what we've been
doing um after this i have like 30 examples of modern data sciences stuff um this is too many so i'm just gonna mention like two sentences about each one and i'm
happy to come back to which are useful and i'll point you to some resources to where you can find more um the big one i'll start with machine
learning um we uh i love this example from chicago they realized they could predict the likelihood of lead paint exposed lead
paint in the walls of houses and then with that you can send inspectors and prevent young children from getting sick from the lead paint chips
this is not a causal question i don't care why there is lead paint couldn't care less does not matter to me right it was painted when we used blood who cares right not the problem the
problem is how do we prevent young kids from getting lung poisoning and what you can do is you can use all sorts of data you can use data that's about when the last time there was a
renovation in the house you can use data once last time how many rats are typically seen in the street it doesn't matter the point is that this stuff correlates with houses that are maybe a little more derelict haven't
been maintained as well and there's some exposed lead paint from mac when and then you can send out inspectors this dramatically raises the success of the inspectors the likelihood
of them to find a dangerous home and you can even time it based on hospital data to when there are young children going back to that house and so this is a different paradigm of
thinking with data it is a relatively new one right it uses data also not really meant for this but to answer a different question to improve a government service this has been slowly growing in the u.s
you can look at a hundred cases plus in a report called the government by algorithm report it's a between stanford and one of our um government agency and then in germany
there's a smaller report but it's i think still very interesting to some of you from fraunhofer and it's um you know k.i in uh often fished [Music]
i'd start using german what a terrible mistake ai in public administration whatever that translates to um glad we're recording this anywho this is
an example of uh totally expanded use of data science very very much new um we've sort of known we could do this for a while but now we're seeing these really proliferate and really start to
fundamentally affect government services this is a big deal you shouldn't confuse it with all of data science but this really matters and i'll talk a little bit later about how this affects the german context right what's happening in
germany that um where there's sort of systemic buy-in to this um another example
uh the first one was supervised machine learning this is unsupervised machine learning also changing how we think about policy analysis pew research which is a survey methods
group in the u.s that has a data lab that does good work um decided that the idea of left and right of liberal and conservative did not
fully represent the diversity of policy or political views so they um they uh surveyed a bunch of people and then they did a clustering analysis and
clustering is the idea that with no specific outcome you could find um latent or hidden groups of people they might not
self-identify by this but you can find groups that describe their behavior and there are a series of methods that do this type of thing
unsupervised learning is generally the umbrella category and this is relatively new as well this has not had a particularly large place in um in policy research and political
science um up until somewhat recently again i'm blazing here a little bit for some of the history but relatively new lots of data more opportunity to find
some of these latent groups um and so pew their data lab pushed into boundaries and how we describe politics um this is just a big data example this
is a very pretty chart from the urban institute where i used to work but the actual um story i want to tell here because i'll come back to visualization in a second this is just
about more data um this is about how americans hold debt it is a very depressing graphic almost all of it it's real bad um but uh underlying this data is actually credit
card level transactions and there's two things that are interesting about that the pure amount of data that credit agencies have is so large that it requires us to do
new things you cannot run analysis on this in on r on your computer you need to use parallel programming or distributed computing you need to use the cloud or high performance server of
some kind it is just enough data there is no way around it and so well often usually when people say oh big data doesn't mean anything sometimes it does sometimes it means you need to use
different software and different and different hardware and that's the case here um and so we did this with spark in um in the cloud spark being a distributed
um computing framework and that's very useful and made this go much faster than we would have been otherwise able to do right so um i say we i actually think the company uh did this the credit
agency did this but um the actual analysis but uh but you know again the size of the data required a new um a new framework
[Music] uh text is data um lots of examples of this i actually think you guys are gonna hear about this as students um given that you're i know
your professors work with texts so i'll just say one thing you can do with it is you can look at the overlap of types of jobs and
patents as in this example i thought this is an interesting analysis and what they said is well we can find lots of patents in certain areas for artificial intelligence um that match
uh job descriptions and so we think the that artificial intelligence is more likely to replace that type of job if we find more patents in that area that
match certain job categories i thought that was an interesting approach to natural language processing there are flaws in this method like there aren't everything else but hey text is data right we're still still experimenting and um
and this is a big one especially for anyone interested in uh political science right so the text data really uh becoming quite prolific network data
um maybe i'll not do two of these this one is about disinformation um
what they did here is they looked at uh the media outlets in the us and um uh with how often they link to one
another so it's the amount of times they're viewed and seen as how big the circle is and then how often they link to each other is their relationship and what it shows you is over time is
that the conservative news in the us has just like broken away it's not that the middle has changed or the left has changed since the right has moved further away from the center which matches a lot of other empirical
research in the us that are right our republican party is becoming farther right rather than a rather than partisanship you have radicalization on one side right backed
by a lot of evidence but in this case by this network analysis um new data collection um this is actually a project to measure
inflation just with web data i know i'm going through these fast sorry i'm happy to deliver the slides and and uh let people die anymore inflation with web data um the current way we
measure inflation is kind of nuts we have some data pipelines but we actually call a lot of companies and we say how much does your chair cost today um i heard the story once where some of
it's randomized i heard a story once where someone actually went to a vending machine outside of like a fire station to like check how much the sodas were right the way we randomize and collect pricing data um
is consistent and it's very valuable it's a little updated and there are some efforts to incorporate and expand the amount of web data that we use to measure prices the
billion crisis project is just one example it's been doing this for a while and it just exclusively uses online data to measure and that's through inflation i think it
is actually technically the official inflation metric of argentina probably this stuff doesn't totally replace government official statistics but we probably find some way to merge them together
and u.s agencies have been working on
and u.s agencies have been working on that over a lot for a while german example um i really like this one this is from florian koich
who's at university of manheim um and working to combine uh digital trace data so the data that you get from devices that follow you around all
day in this case alpha and using that um to combine that with uh um administrative data from agencies sort
of like in the example with prices um but this is a little more about stuff like mobility um people's mobility has been a big question in social science for a long time where do they go how much do they walk what are their
commutes does their movement and distance correlate with job outcomes anything like that right and we haven't really been able to do that um for a long time now we kind of can um
the downside this data unlike many of the data sources i'm mentioning is not necessarily representative you have to have a smartphone you have to get contacted by these surveys you have to
agree to participate in the survey and so the first step is always and this is largely the stage right now how to actually learn from this data in a way that reflects your overall population
and so again really interesting i am absolutely convinced that we will have official statistics that include digital trace data like this um it's just a matter of of when and how to carefully
and consistently institute data pipelines um the representativeness is one challenge i'll also mention that the willingness of private companies and individuals to allow this data access is
is another big challenge for understandable reasons um i'll skip see set for now that's cool um satellite data is great
this is another imagery of the networks and text and web data so um one with imagery um there are lots of places in the world
that don't have high resolution wealth data um especially in developing countries it turns out you can do a pretty good job of guessing um based on uh satellite imagery so specifically in
this case what they're doing is they're taking satellite imagery of areas where they do have wealth metrics making a predictive model that says when
the rooms are made of this material and the road density is this dense and there are this many cars and this many people that's about usually indicates this much uh income
and then they're applying that model to the areas where they don't have the income information right and then building a better than nothing estimate of
economic welfare um this just was piloted i think by stanford it's now just like part of what the world this is like part of a world bank project that um does this is an ongoing way another really cool example
is tracking deforestation with satellite data and i think very clearly the best deforestation data now is from satellites there's an example where it's it's even easier is there a tree here yes or no
um okay cool and the last one i'll mention um is it changing since last one am i a liar second last one i'll mention is the
changing approach to data visualization um i love database i'm gonna take a second to click on one of these show you what i mean um
the idea that graphs are new isn't really a thing i think the percentage of the stream that they're getting is really what's changing and you see this first in journalism and
media but you're seeing it more and more from research institutions where the data is taking a higher percentage of the screen and in fact in many ways as you can see here the graph is
actually staying while the text continues up and what i really love about this this is a prediction about a forecast of how prison populations change under
different policies you can click on different scenarios and the graph that are in the text and then you will see those reflected in the graph the urban institute
where i used to work led by a team of data visualization people um led by again named ben chartoff which is an excellent name to be a data visualization developer uh have done
this really really well and um uh and have this really excellent um eye for
uh storytelling with visualization um so uh where they sort of weave in the graph and the story and so here's another example where you have an individual and they mark this individual with a blue
dot and then as you follow down and go through the graph you can actually see that individual this is about their commuting distance in dc where i normally live and you can watch
that person continue across the graph and then continue on there's some more data following the blue dot and then they're going to switch people to a yellow dot and then follow this person's commute
across the city um and i just think they've gotten a lot better at weaving these stories together we're going to tell a story with data but we don't just want to show you a graph and then show you some text we're going to sort of
build this into a little bit more of an immersive and interactive um feature um and i have some examples uh in this that i can share um yeah but but more that so
it's a this this while database is new we're seeing this much more interactive experience and maybe immersive experience of storytelling um and and the last one i'll mention which is uh
in some ways the least exciting but man can it be really important it's actually just data availability um so the example here is from
uh the sorry the urban also from the urban institute um the education data explorer um and this is actually just an api um
by which i mean the federal government has a whole bunch of data on education as to the states and their ability to consolidate that data and put it online
has has been not very good um and though the us government is getting better at making apis and making data available um sometimes uh it is just necessary for someone else to do it it's
been a bad enough state for a long enough time uh but the urban institute collected this made a consolidated place to get it um you know claims they did once really thoroughly so everyone can get to it and
then document it and um put this online you know in this talk i mentioned a whole bunch of examples where think tanks and policy research institutions were finding new ways to collect data
it's useful to think um if you're going to invest in this type of work and you're going to build a new data collection or you consolidate data effectively your analysis is one outcome but another useful outcome is
the data itself right if you're going to go through an elaborate process to collect a bunch of data it's actually not such a bad idea to to just make it available in addition to your analysis and both of those things
um can play a meaningful public role and it's from a data science perspective um and that the urban data explorer is a really good example of a non-governmental api though in
governments they're also really important um so that is a range of new things i see both in methods and in data collection
maybe data collection being even the bigger change over time and then the method's kind of being driven by that and i can bring this to the german context but let me again stop and see what people want to talk about and what questions you have
yes we have one question from ajit um i think it's also quite related to what you just mentioned regarding urban education and urban institutions he wants to
ask if there's any suggestions for open data sources for academic research yeah it depends on that that's um too broad for me to answer usefully honestly but um within
uh specific areas of academia the amount of public research is is pretty high um so it's a little
um so i think within an area i can probably help i'm gonna put my email address up and you're welcome to email me with like i'm looking for data sets on blank or in this field and i will i will give you some um pointers uh google
there's also a google data set search now which actually um is interesting and might kind of work um uh but but broadly i think that's almost almost too
yeah i don't think i could say it's useful like how to broadly find data um but googling is actually the number one answer followed by learning about the research institutions and data sources
in your space and then like foia is awful yeah anyway no i'm sorry too a little too broad but i'm happy to come back to food any other questions for the moment
broader expanded data science space you're welcome to just unbeat yourself in chipman and nothing at the moment um alex okay all right i'm happy to come back to this so all right so what does this mean more
specifically in germany um a lot of what i just talked about was policy research but that's gonna be related
germany has just announced a couple interesting things um in in broadly this space um
there are a couple interesting points of uh history before this but the most important is the online access law um because of the large number of
services it's going to digitize um there was already some data for many of these services even if you fill out something on paper it ends up in a database but the the number the amount of additional data you get when you
digitize services gets much higher it gets much easier to change and collect more data you get much more users user interaction data as well and it also leads to more opportunities
to automate processes when you already have this significantly online process um and so this is a big deal and even if you don't consider yourself as someone who wants to do um civic tech like building services
versus for development this is going to create an enormously uh large amount of government data for which data scientists will be useful on top of this you have a new federal data strategy that came out in january
2021 and then written into the work the german part of the eu recovery bill is 240 million dollars for data
labs in every chancellory and the chancellor in every ministry um it is not exactly clear what these are going to be yet uh but it has to happen
they wrote it into the there was a bit of a political slant at hand to make sure that any um governmental transition didn't affect us and so writing into the eu recovery bill is actually pretty
smart then it will definitely happen here um [Music] and so they're all going to hire chief data scientists and build something it's not exactly clear what that
expands the data capacity of every industry and again building on top of the sort of large number of digital services this also falls into the new coalition's
um agreements the first actually the first section is titled modern state digital awakening and innovation has various components but one of them is making services better at the governmental level [Music]
um which you sort of insinuate digitization and maybe even it uses the word proactive which maybe in some senses you could interpret as a little machine learning a little a
little bringing services to people when they need it and then you get an expanded number of civil society organizations the one i've seen i'm funded by shift john mccotter they really care about this um
bertelsmann chieftain and stiff changoya for antwerton just both started data science teams um and corelade has been doing this for a while and servicing um nonprofit work
with uh data science but there's clearly i think a little bit of [Music] a civil society uh awakening and expansion into the sort of applied data science
um space um the other thing i didn't mention on here some of these organizations that i mentioned i mentioned fraunhofer i mentioned ifo the contractors in germany who have been doing
um data work for a while are also expanding their data science schemes somewhat analogously with the us and there are headwinds to
um low government digital services use is a big one um strong privacy concerns um i didn't mention
hostility to algorithms specifically but for anyone um who followed what just happened in the netherlands in which a government was partially brought down by a poorly
performing algorithm that probably doesn't bode well for citizen trust in governmental algorithms um this was a algorithmic process that um [Music]
sort of punished people unnecessary unnecessarily in terms of benefits [Music] and then on a decentralized uh government makes maybe slows this down
as of uh limited ministry access to their data so again the german developments there's a lot happening here there are real structural barriers um but i think there's a fairly high amount of interest i feel like there's a fairly high amount of
political commitment and so you could see an interesting expansion of the domestic capacity and uh and work done um though there's it's not this germany is not exactly starting using data for
the first time it's been going for a while here okay so that is an overview of uh how i see the field how it's sort of evolving this is the
slide i started with um and a little bit about what's in germany i want to show this to you a slightly different way um which is what's established versus
what's emerging and then i'm going to transition from here into um into what this means for you all uh in terms of like getting jobs
um so stuff that's been around for a while proprietary cystical software if you were doing the equivalent of this 10 years ago or 15 years ago you'd be learning stata or stats and not r and
python so we've moved from proprietary software to open source languages and tools um as a general rule you should try to learn open source languages and tools every
chance you get and avoid learning proprietary stuff unfortunately governments and some research labs haven't caught up with this change yet but as individuals you will be dramatically benefited from investing in
the open source side or in python being the most important but there are other tools um a big change
we policies put in predominantly public data by which i mean created and collected by governments um in some ways government data is not actually getting much better it's
getting incrementally better whereas the amount of private data is getting dramatically better and dramatically larger so learning to work with private data in some fields can be really important
i'm interested in getting credit data from uh credit card companies that's one example um all that satellite data is coming from private satellite companies
um with the little very small micro satellite sleeps or a lot of it is anyway um all the text data much of the text data that's coming from the web right
um data visualization was here before uh relatively static charts now we have this much more interactive and immersive um
side of it also much more web driven right so so far far fewer pdfs far more um websites uh and then in methods i mentioned we
had experimental design causal inference micro simulation and relatively small data sets um now we have machine learning and some cloud compute and sort of bigger data um
but in all the ways that i discussed uh so so pretty big set of changes right and you can see how it's easy to confuse with like this data science with being like
it's just the stuff on the right but really it's it's we we should probably be using it to refer to all of this and just recognizing that we're going through a particularly fast transformation and what these methods are at the moment
okay so i mentioned this a couple times but i actually just want to make sure it's clear that if you go into an interview at a lot of places that have been doing data for a while and you're like hey i'm here to bring you my data science skills they're going to look at
you like you're nuts um and should be a little a little careful even if you are really awesome at learning or really awesome that like text analysis or graphic or whatever it is just be respectful of the fact that there were
probably data people in many of the institutions that you want jobs that have been doing this work for 30 years and they probably understand your field pretty well like reasonably well and have like valid criticisms of why those
various new methods are not as useful in their circumstances they may be overly skeptical and there may be value that you can add but i would just be careful of that opinion and you see this mistake like a lot
um and so you know the i'm here to replace the old process with a satellite data analysis is like very rarely going to go over super well for you being like i'm excited to contribute and there are
things on the margins that these new methods can do it's like a little more likely to um to get you a successful a successful job it's also important to understand
the old stuff right as i mentioned none of this replaces causal entrance so the extent to which you still need the old stuff um i always get really impressed when i need someone who understands
like micro simulation because it shows a little bit of like respect for the existing institutions which you need to know to sort of change stuff so that's the first um here's the second
one sorry this is too much text um i just mentioned way too much stuff for you to become an expert in two years or three or five
that was dramatically too much material for you to become an expert at so how do you approach this problem um i would choose a policy area
and develop depth in that and let that choose your data and methods um i mentioned political science networks and texts are really common
international development sensors and satellite imagery whatever it is you are i think much more likely
to be qualified for employment having an overlap of domain knowledge in a specific policy subject and data science skills that match that
then you are to then viewer being a good general purpose data scientist probably the most important advice i can give you and and in sense
it's going to screw you at some point anyway it's just too much stuff related to this one of the ways you can learn this is to execute on applied projects with real data
as students every chance you get you should execute on a applied project with real data you should try to do the entire process
from collect if collections necessary data collection but if not from data cleaning to analysis to presentation make the whole thing look good and put it on on github um you should and i
basically mean never be doing like casual competitions or just machine learning competitions or really narrow skill set stuff i think it is almost always a bad idea in your situation when instead you can be
working on real data in your field um and i would also argue that being familiar with the important data sets in your area if it's conflict and you don't
understand the acla data set i'm gonna throw you out of the job interview right really understanding the important data sets in your field is as core of a job skill in data science as understanding any individual method in fact it's more
it's probably more important most of the time so um for both of these reasons one because it'll pare down the methods to a useful set and two because understanding the data and the difference
it's so important i would really let your subject area choose what you learn to some extent for you um finding and reading papers and what the cutting edge in that field is a
great way to do this um and your resume can reflect all this there's a way to talk about this in a fairly straightforward way um and so an example at the bottom [Music]
they said you can do this including on class assignments right when you can but certainly in internships or relationships i accomplished a specific meaningful outcome
first thing i think it should say with a technical skill to sort of technical details but in this
case um i um trying to take one from that's relatively accessible uh
discovered spatial and you know spatial disparities in um grocery store access using uh rsd
ours geospatial library um building a new spatial equity tool something like that but a tool and software
package programming language specific in a specific one i endorse putting specific libraries on your resume um but pairing it with a policy outcome every sentence you can write like this about a
specific policy area is i think the absolute best thing you can do to advance your likelihood of getting a job in this space and um and this is a little um
[Music] this is this part is a little scary um but the more you can do this you will be better qualified for fewer jobs and that's a good thing
having a ton of experience in just health and data science taking the classes that are in that and learning the methods and knowing the data and doing an internship and doing an rra ship all in health and data science you
will be qualified for like five jobs but you'll be like the best candidate for those five jobs right and i actually think that's a better and safer place to be in your career early on than being an okay like a pretty good
like the third or fourth best choice for 100 jobs i would rather be in the first situation than the second so the more targeted you can get i think the more qualified you will be to do specific stuff you also have more control over
where you end up um and again learning everything learning learning all of data science is probably a boss um i have a little more on some of the job roles that aren't data
science um but i think this is the thing i wanted to to end on this is like the most important piece of advice that i have and it sort of um maybe is helpfully framed by the work i did so again i will stop and take questions and if
i'm also happy to talk about something else if you have any questions um you can just raise your hands and directly
talk to the mic or yeah but um i just want to echo what alex mentioned regarding some the domain knowledge with data science because um
i think within the lab we also talked to a lot of industries and um a lot of different companies and i think the common theme that we heard a lot from different companies nowadays is that the
name of the game is no longer to be a generalist data scientist right so it's more about becoming a specialist in some field and having the data science skill to back that up
and that has been echoed by a lot of industry experts and different companies um and i think it's very important point to just about that because i think a lot of people get stuck in this loop of learning
so many different methods and so many different skill sets and trying to apply it all to nlp to computer versions and to all different kinds of new and fancy technologies but then at the end of the
day it's also the industry what kind of industry you're going for it's important yeah it's too much stuff it's become it's become an incredibly
diverse set of skills i mean there's some stuff that everyone should know right depth in one programming language getting github the command line some understanding of the cloud right causal
inference there's um some probably probably everyone should have some machine learning experience but you guys i know you guys are getting that um but after that it's and even even that far it's sort of like well
how do you how do you weigh trade-offs between the 87 000 technical skills there are i don't know right or without being guided by some sense of the policy area
there's one question from caroline um do you have any advice on how to choose which policy area to focus on yeah that's that's like yeah it's the hardest part um no probably probably the thing you're
most passionate about is a good is a good guess there are some areas in the overlap of data science and policy that are easier than others so the first answer is definitely the
thing you actually care about right like in the field that you want to work and find far in a way the best choice there are differences between them there is a lot of health data there is no shortage of doing
health data sciencey work just the pure amount of health data that we collect and the sort of emerging field of precision medicine and the amount of sensor data makes health and data science
kind of easy in the long run you just know there's going to be a ton going on there in the near term data science and international relations and data science and foreign policy is a dicey or bet
it is getting more empirical um in fields like conflict studies international development has been data sciencey for a long time
um and one of the places i mentioned briefly the center for security and emerging technology at georgetown is sort of taking a very very data sciency look
at foreign policy and doing a really good job there but broadly the number of jobs and data science and foreign policy is dicier it's a little it's a little lower
um that may could change feels like it's on a bit of a prank but um there are some implications there so broadly i would say go with what you're passionate about you might want to prepare a little bit for the fact that
there are differences between the degree of data science focus in various fields alex um i have a question that was based on a slide that you
showed earlier so i think one of the slides that you show is nowadays there are so many different ways to visualize the database and also tell stories based on the data that you have and the results that you have from the data
and i think you show a couple of examples on how you can do that and making a very interactive data story and so maybe you could give the audience
some advice on maybe some of the technologies that is needed in order to learn how to create for example a data driven a journalism article like that or a dashboard that is
interactive with different people uh the there's good news and bad news um the good news is the stack to do that is open source and free and accessible the bad news is that it's hard to learn and
you have to become kind of a front-end web developer um so in my little archetype of like job roles the current state of doing data vis full time is kind data visualization
full-time is its own specialty and you probably would have to just do that um it's to be to be the kind of person you can really build those consistently um
the guy who invented the language that every invented d3 which is a javascript library for visualization and it's really really good and really impressive and then that's why the new
york times hired him um they paired him with a designer named amanda cox and the two of them went on to be the best database team in the world for quite some time microsoft has gone on to do other things
now but the reason that new york times data visualization is so insanely good as they hired the guy who invented the framework um and then um paired him with uh with a really truly
um designer a true excellent designer um and the upside of d3 is that it's incredibly expressive and you can make basically anything you want with it
and there's a billion code examples but again the downside is that it is hard to learn if you really are passionate about data visualization like not a passing interest i would push you towards scott
murray data visualization for the web this book interactive data visualization for the web by scott murray um there's the
second edition maybe third edition now um and read it and make lots of things in in d3 it's uh it's an uphill battle but it's uh it's one of the most fun ways to it's one of
the most fun coding once you once you do it once you learn how to do it thank you very specific technical stack and then like if you're not going to do that learn r and g plot and some adobe illustrator you'll make really pretty
static graphs but the the jump to the web development's like it's it's a big job yeah no definitely i very much agree with that uh having to do some web development as well
um so um uh alex is possible that you can share your email address in case any of the audience have some questions later down the mind yep i will do that in just a second this
is the last thing i wanted to mention um the two reports on how machine learning and algorithms being used in government these are really comprehensive the first one for the us the second one for germany
um and i mentioned a bunch of sort of good u.s institutions that just uh are
good u.s institutions that just uh are really expanding how they're using data science they have medium posts of like a blog right now for medium data urban is
the one for the urban institute a few research is decoded are really good and then georgetown this is the
center for secure emerging tech um i think they're probably the most advanced policy research think tank ever
um and i would encourage and you can just like read their methods papers and you'll see they're doing a wide range of kind of incredible stuff this isn't intro stuff they they are they are really invested um
but some of their papers will show you where the sort of cutting edge of data science foreign policy is um and then feel free to reach out i would be more than happy to yeah
thanks to my funders also also a professional lesson always thank the people who give you money to survive um and uh but i really do encourage you to ask questions now i can stick around
for a few minutes or if you would rather not you are welcome to uh um send the uh [Music] questions over here yes i think uh benedict want to ask if
um it's possible to share the slides um yep yeah yeah more than happy to share this okay and uh if there's no questions um
then i think we can end the sessions here um so thank you very much i will encourage questions one more time i uh if anyone i'm slightly absolutely i'm sure normally i get more questions than
this so i'll just mention again that uh i would encourage you to just unview yourself and ask i have helped like a couple hundred people in your position like start this field and get jobs so it
might be might be useful just feel like i'm maybe i i don't know i talk too much possibly i have a question yes thank you so i'm um an mds student at the 30
school and i was wondering or i'm a bit concerned about finding internships in like the current state of like really starting to learn and was
wondering if you have like yeah suggestions which positions or other institutions might be a good fit yeah for the beginning
yeah um remind and remind me this would be for this coming summer right yes yeah so after their first year
yeah yeah um in in some ways and and this toy level with you in a two-year program this is probably the hardest part in some ways i actually think getting a good internship in the summer
can be harder than getting a decent job after graduating um and i mean that because by the time you're applying you just haven't done that much yet it's just a bit of an accelerated timeline
so first it's understandable to be stressed the good news is there are very few of these programs in europe there are only a handful of people doing this
um and so you should even though you haven't done that much yet be somewhat competitive for data internships um i'm not going to be super useful on institutions i'm
starting to write it but i mentioned a few like stiff-tang noise and work and light that bertelsmann stiff tongue um group er the foundation that are both
starting data science teams um and i would look at them um the thing i would do in terms of
strengthening your ability to apply um is in fact to try to do a couple even if they're fairly straightforward even if they're really early in your
sort of set of skills to do a couple um self-directed projects you have something you can point to that is hard in your early period of
classes um and it may not be possible yet but i would say the sooner you can do even one or two things where you said i um not only did i like go to these classes
but here is my demonstrated ability to do something um that is the the way to make yourself competitive um at the internship level and interns just to just level with you interns are
infrequently very capable uh it's just all it's just often like a little too early and so if you can say like here is a data analysis that i did from a you know from start to analysis look at it
here it is printed out even if that's just one um that's going to make you i think relatively competitive um again within a
policy area do it on policy data um with a goal in mind of specific institutions um if there is somewhere specifically you would like to work uh looking at the type of work they're doing and then working on uh
on their data to execute on something is not a not a bad idea maybe slightly overthinking it but that's not not a terrible plan right hey they're working on this data i can do analysis using that
well thank you very much yeah sorry i'm not as european and german institutions that that i'm a little less useful for i will say as a
general idea uh twitter is also very useful um [Music] the data science and policy community are both on twitter like they're two different communities but they're on twitter
um and the extent to which if you find an institution that uh you might want to work for i i swear this sounds dumb but go look at their twitter
go see who works for them and what they're saying on twitter those people will have like retweeted other data science and data analyses in their same fields and you will find other organizations that way actually like
twitter is a really good tool to find your niche and define who's working in it because there's probably an established network of people retweeting one another's work so if you find one institution that's
interesting uh twitter to find others can be super super valuable steve care also has the questions on python ids that are dominant in public policy research is there a python id
that is dominant in any field um i am slightly partial to your notebooks um despite the fact that they are not the best
um coding environment for software um jupiter notebooks have a strong value in their literate programming which i didn't talk
about much but but both r are markdown and in jupiter notebooks you get effective use of literate programming which lets you have your code and your
text and your results all in one place um this saves you from a couple things it saves you from recopying all your graphs into a word document when your
data changes um it saves you from screwing up versions and having versioning issues sometimes so if you're in a world where you think your analysis
might be shared and you want to quickly go from code to communication i'm a big fan of jupiter notebooks and also our markdown if you're really building something that's closer to
software or a development pipeline then never mind then you that's not what literary programming is for but i'm a big fan of jupiter notebooks no mark down in the policy space
i wouldn't call it dominant i would call it it has a differentiating views you're welcome it's all right oh yeah another question okay cool let's keep coming
also if you guys have to run i also realize it's like 6 10 on a weekday so if you're just on here to be polite please don't don't feel obligated to do that uh please ask away
uh thank you alex for remembering what's very interesting um i have one question because you also work in the field of ai regulation and ethic kind of perspective on that and i'm also very interested in
technology policy in general and yeah european scale cool i'm just wondering okay now i'm studying also data science for public policy i have some coding background but i don't see exactly
from my perspective as a student how data science is applied in this kind of regulatory policy area yeah totally good bet um i think i think you will find uh at some point there
will be a turning point from like no jobs to lots of jobs in the space um i would not be that worried about it um
it's definitely an emerging role um you know the the most obvious thing in europe is going to be if the dsa
and the digital services act the digital markets act and the ai regulation pass the ai regulation explicitly says
you will need some amount of you know ai auditing and sort of data science regulatory capacity both in existing agencies like the ones that are doing um
product safety work uh as well as in agencies that oversee various um tools that now use ai like the ones that were overseeing the lay market or were overseeing finance
now need to have this sort of data science capacity so if the ai passes first so absolutely it explicitly says we need you know that um european
countries uh member states need um ai regulatory capacity i don't think that's the only way it happens i'll give you the u.s case which is a little bit different um in the us
we're probably not going to pass a big piece of ai legislation but there is a slow emerging sense that regulation and the process by um we regulate things needs to account for algorithms we have
guidance from the food and drug administration and how they oversee medical devices that says we actually need a new process here
because the medical devices that have uh ai systems change over time how do we regulate something that's learning and changing um
maybe a kind of related example maybe we care more about um physical devices that have autonomous systems in terms of their cyber security i think it's fair to say we care more about the
cyber security of an autonomous car than a human driven car right and so there are areas where you can imagine more rigorous stress testing on ai systems as part of the regulatory
process like in those two examples as well as sort of this evolution of human services around say hiring and finance and in the us we're seeing that
incrementally agencies just adding these rules and jobs over time and using their existing authority to regulate in europe it could happen a little more all at once
but i think it's totally you know your ability to run an audit to stress test an algorithm to um simulate what might happen if an algorithm learns in a certain way over time i think those are
going to be important regulatory skills absolutely and probably a big source of jobs um for you guys at some point yeah i'm super excited about that field i think it's super interesting
thank you so much you're welcome seal viewer also has some questions okay can you hear me yeah excellent i am not a data scientist so i didn't came into this discussion from a perspective of data scientists or a
programmer i am more interested in practical applicabilities particularly i am interested in using uh technology in the humanitarian sector
the deployment and to what extent my main question and how do you see exactly the opportunity cost of engaging technology and how would you address that from a perspective of a public policy since
every money that are not spent for the beneficiaries are money spent for something else and the current system is that you would develop a project and then they invest into something expertise or
technology hardware and then after three years everything gets unused to say the least how do you see this whole approach uh how you would you
approach from a government perspective and what kind of solution do you have for a long-term sustainable approach in this system can you can you rephrase the beginning of it one more time you're asking about is this specific to an area of policy
and i just missed that uh humanitarian emergency response yeah for example in a refugee camp or
distributing or cash transfers through digital currencies or anything related to that that hardware and the software comes with the cost which is an opportunity cost those money not being
used for the beneficiaries yeah no i think that's totally fair um saying i'm not an expert in this um and this is uh this is this is probably not a good
answer um in program evaluation by which i mean the testing of
whether or not interventions work we spend a somewhat limited amount of money uh there's sort of a rule of thumb that you spend only five
percent on evaluation right and and we frequently don't even do that um i think with data analysis and data science
specifically you could reasonably cap um how much you think this reason is you want to spend on that right because you're right none of what i'm talking about
or in rare cases is what i'm talking about actually the service and it's important to remember that data sciences should be in the service of better
policy and should be in the service of um a better governmental administration or better direct action or better humanitarian efforts as you're mentioning and i i think you're right that there is a
absolutely uh um a cost trade-off for a lot of this stuff the amount of money you're talking about typically doesn't exceed five percent um
when i talk about the data analysis now if you're saying that you know they they're spending a ton of money on ipads to do data collection rather than actually giving people food or housing
um i think that's that's a that's a different maybe a different question and a different criticism but not necessarily a bad one um so within within the very narrow
slice i know the amount of money we're talking about i don't want to say it's insignificant that's not true um but it's typically in that like three to five percent
range maybe um and then maybe whether or not you're buying technology for the people on the ground in humanitarian effort that might be more money but it's also a trade-off i probably can't speak to so that's a lot
of expertise that was not a very good answer i apologize yeah sure no the the suggestion i'm making as a media election is basically to
develop the institutions you just mentioned as part of the government structures and then donate uh man hours instead of money so this way you have both the expertise
and you have the long-term security for for the kind of data scientists for three years that would be difficult if you have a career in a data science data science in a
government and then you're allocated to work let's say three months on a project and another three months on a different project and so on and so forth it should be making much more sense for the humanitarianism as far as i'm telling
you um i i know that some of the larger humanitarian groups in
the us have data science teams um our our fema does our usaid does that's the emergency response and our uh made organization respectively
some of the larger non-profits um have toyed with um on building data science teams uh that things like try to optimize
how you place out housing in a refugee camp or where you place um water access i think was another big question um if you get to a large enough camp that that can turn out to be an
algorithmic question um i think there is value in some of those things um that is pretty specific uh i wonder if organizations
at that scale can afford this dedicated data science teams but um but i'm not sure yeah thank you yeah that's a good question oh i wish i knew more
about that also a growing area foreign policy and data i'm just on the humanitarian response yeah so i think if there's no other questions then you can always send an email to alex or tweet him as well right
so yeah yeah so i think uh it comes to the end of our sessions thank you so very much alex once again for a very enlightening talk i think on a very interesting topic
that is not often covered i think by the industries and experts and so we're very sad that you're leaving berlin for brussels but i'm sure that you'll be doing amazing things there as well
and this is also the last event for the data science lab this year and we will return in january next year with more news and events and we wish everyone an amazing advent season
and a very merry christmas ahead um thank you very much and have a fantastic day thank you once again alex and everyone for coming thanks everybody appreciate everyone have a good one take
care
Loading video analysis...