Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
By Databricks
Summary
Topics Covered
- Full Video
Full Transcript
hello everyone uh welcome to my talk today I will talk about how to build a unified machine Le machine learning Pipeline with X boost and Spark uh my
name is NW uh I'm a software engineer from Microsoft uh in Microsoft I work on integrated spark streaming St streaming with as event hubs which is a real time
messaging ingestion service on Asia and my colleague and I will give another talk on this topic at 5:00 pm today so welcome to that talk as well I also work
on test monitoring and optimization for spark workload performance I serve as a Committee Member of a patmax net and dmlc I also
contribute to spark so about dmrc distributed the machine Learning Community uh we a group of Engineers researchers from different
organizations we collaborate on open source machine learning projects uh some of the projects we build like H boost which is a major topic today and
maxnet which is one of the most popular de learning Frameworks so in today's talk I will first uh introduce X boost and exbut
Spark I will talk about uh the project history uh some brief introduction about how exbut works but due uh due to the time limitation here I will not go into
agre details and the formula derivations uh we have a paper on kdd 2016 uh with plenty of details about uh the algorithm if you're interested in
that you can read read it and I will go into details about why we want to integrate x boost and Spark uh I'll say what's the real Pro problem
we want to resolve by integrate these two systems and I will uh give more details about the design of XG boost and finally we will have a brief discussion
on what we can learn from Exo spark design so this is my personal contribution to uh X boost
project and what is XG boost XG boost is a gr in boost tree implantation uh it was created by T Chen who is a PhD
student in University of Washington in 2014 and uh until to today it had been more and more measure uh it provides bindings for different programming
languages you can run X boost on single machine distributed machines and you can run it with GPU for ex boost spark it is the real
effort to integrate ex boost and Spark uh the idea was generated in a discussion in nip 2015 between T and me and in March of 16 we released the first
version of EXO sparkk which is based on rdd and the after the summer of the semi year we have the second version which is
based on Dat frame and the M framework uh we have a gring community around X boost uh more than half of winning Solutions in K competition are
using X boost and uh for execut spark we attract a lot of users from different companies like Airbnb Microsoft Uber and our developers are from
University of Washington Microsoft uptech and some other companies so then let's go technically deeper to
introduce uh X boost as I said X boost is a grent boost Deion tree implementation to understand this concept I will separate into two three
parts first what is a Deion tree and second what is the boosted Deion trees and uh uh seert what is a great in the boosted D
Tre for D Tre in XG boost we adopt one type of dig tree models called classification and regression trees and
like many other D Tree models uh a card assigns the de points to the leaf note of a tree structure based on the ru on different level of
the tree for example here we work on a problem like we want to predict whether this PR were like computer games so we have the first rule on the
root level but uh which separate the data sets based on the age of the person and then in the lower level in the younger group we have we based maybe
because of that a lot of games are not properly tored to girls interest we can uh separate dat set further based on uh the gender of the person and specific
two card classification and regression tree is to were assign a score to the leaf node of this tree and the de points associating with that leaf node will get
that score and use that score as a best to answer a predictive question and here the youngest boy get the plus two uh it means that it were uh he is mostly like
computer games then what is Deion tree boosting decision tree boosting means that in our machine learning model we use more than
one trees for example here we introduce a new tree which separate the data sets into two parts with the new rule that is based on whether this person will use computer
daily and specific to a single data point the overall score it gets from these multi- tree models is the sum of
all score it gets from every tree so the idea here is called insample learning in uh General machine learning context that means we use multiple weaker Learners to
achieve better performance than anyone alone and in X boost for each iteration we will build a new tree the remaining question is that
specific to each iteration Al say each tree how can we determine this is the shape of this this tree is the optimal one we want to get then it involves the
third part about our of our introduction to X boost what is the gradient booster tree before I introduce granted booster tree uh I will let's review some concept
in supervised learning supervised learning means that we have some labored samples from certain data distribution and our task is to find a
function to describe this distribution based on our observations these OB observations are those labor samples so to achieve the goal we introduce a
objective function here this objective function contains two parts first is a trending loss this part measures how well our model will fit with those labor
samples that's the training data and the second item is called regation item it controls the complexity
of the model because we only get some of the some some of the some samples from that distribution we don't want to build a model which is too complex which can
uh meet with those laboral samples where Weare but ignore those unobserved uh samples in the same distribution and in ex boost we also
have this objective function that means in each iteration we want to find a function which represent the optimal TR
structure we want to get to uh in this iteration so how to find this this function we have to answer two questions
here first specific to this tree how to determine the score on the leaf node so the idea here is that we the first
step for us is to represent the object function as something related to the score on the leaf node so here we Define the function as a vector of the score of
the leaes and another function which Maps the de points to C certain Leaf node and for regularization item uh we
Define it as the number of nodes and plus the sum of the square the scores of the Lea nodes and then we can transform our objective function as something
related to the score and after a chain of Transformations we can get the uh optimal value of the score and the smallest value of of objective function
and uh I will not go go into how to derive that if you are interested in that read our ktd 2016 papers and the second question to design
um tree structure is how to decide the the best splitting point for a specific node in this Tre structure for example here why we want to separate the root
node at 15 the idea is that since each node were associate with a lot of trending sorts
and for each feature these sort may present some uh continuous values we want to avoid enumerate each value to find the best split so we propose some
uh we transform the continuous official values as a discrete buckets and then these buckets should presume most the even data distribution and among this
buet the boundary of these buets we will take the one which bring the largest loss reduction so how to achieve this are also described with details in the paper
so uh if you're interested read our papers we also have other optimizations in X boost we can build a tree with mod stress we can scale the training with
dis space you can also train in a distributed way so we have talk about so many good things about X boost so let's build a machine learning pipeline in production
with X boost so how to do that first when we build a machine learning pipeline we have some raw data with different formats in different places and this raw data may be broken they
have duplicate duplicate columns they have missing values we have to click clean it so spark is one of the best tools for us to do that cleaning and
then maybe you have a different teams in your organizations uh or you have different preference you have to save the clean data in different formats in different
places finally we can introduce X booster here load the training data build the model don't forget that X boost is just
a component in your overall infrastructure you have to make H boost collaborate with your supporting infrastructure and for your Trend model
you have to Output in a certain format to serve in production and because we want to make X boost to load different uh different format of data to
coordinate with our supporting infrastructure we have to introduce a lot of gluc cod to bind X boost in your infrastructure and in X sorry in nips
2015 there is a famous paper from Google te they point out that a metor system might end up with at most 5% of machine
learning car code the X boost is a machine learning car code and it will be with at least 95% of gluc code and
unfortunately this gluc code is costly in long term because it will bind your infrastructure bind your infrastructure with a certain machine learning cor code
and you can it will be very hard for you to test Alternatives so how to resolve this problem uh let's review the design of another system which we might be more
familiar with that's spark M Le for spark Mr Le it provides a framework for different machine learning
AGM this AGM access outside data with a standard sparker SQL API and because this agance run with in framework we
don't need to introduce any additional gluc code here since spark has handled everything and additionally spark M A framework provides a set of tools for us
to do FAL engineering tuning and a lot of other tasks so the idea here is very clear our goal is to make X boost WR as
a native algorithm in spark ml framework so how can we do that first we know that spark is a
cluster Computing framework and it is centered on rdd so our first mission is to make X boost and Spark communicate in
execution and memory layer so for execution layer let's first review uh how distributed the training with exost Works uh in exost we have a
central control control process called the tracker and we have distributed workers and then for each worker it will
take a partition of training data and locally it will compute some local uh States and then with all reduced
algorithm it will synchronize this States and finally with those synchr States it will the tree our task is to run this model in
spark cluster the first thing for us to do is to start the tracker in driver side in driver side we start a Tracker with a process and then in executer site
in each spark task we will call the we will start the X boost worker with jni interface with a uh Scala code and then
these workers will communicate with each other with the original or reduced layer in X boost this is the Y excution layer then
in memory layer because we want to make make EX X boost and Spark communicate between uh Native memory space and rdd memory space so we just copy the data
with the J interface from rdd memory space to Native memory space and the trend model and get model in Java space one of the implementation details
here is that gni calling is very expensive we don't want to copy sample by sample so here we copy in batches and then with the native code we will
interpret this memory batch and uh transform it to the format which can be read by native code native car code of XG
boost so all of these details will be wrapped by our API for example here users can load the data with a standard spark cql API and then with the plan
scal code is configure X boost and um finally with our one light API it's just the trend model get that and do predition
now we can run ex boost within spark cluster the next thing is how to integrate spark ml framework because ml framework provides so many good things
feature engineering tools pipeline persistence tools so how to integrate with this framework let's first review how the
machine learning pipeline looks like with spark M framework first we have some raw data and through first set of Transformers we can pre-process those
data this Transformers can be stren index for example it were uh transform the categorical string features into numerical index and with one H encoder
and finally we arrive to estimator this estimator will produce another Transformer sorry that is actually the model we want to train and
then with this model we can do some prediction result oh we can get prediction result and this pipeline is separated into two parts first is trending and the second is prediction
and in many cases we want to use prediction result to fill it back to the trending phase and tune our
models so then our first task is to how to make ex boot Trend as a native spark M algorithm and here we Implement spark
uh ex boost estimator by extending spark M estimator uh class and uh trigger the distributed workers and next how to make
X boost predict as a native spark M model and we Implement X boost model by extending the Transformer
framework and finally how to make it make H boost tunable as a uh Native spark M algorithm so we Implement ex
boost parameter system and make it accessible by many in spark M framework because we Implement all these
things we can build a four machine learning Pipeline with X boost and uh spark uh ml framework here is the
example for ex first we build some pre-processing pipeline all of this are done with the uh spark M utils and apis
and then we search the optimal parameters for X boost with the cross validation tour this is also from Mr framework and we evaluate the
performance of X boost with the spark uh gbt classifier we use a airline data set we combine several years together and we
run it on aure then we can see X boost is 10 times faster than spark gbt classifier we also improving the
efficiency of X boost Spark uh here we one of the important work we are doing is a unified memory space as we said we always copy the data from rdd
space to Native memory space and it will increase the memory footprint now we are trying to use a patch Arrow to share the memory space between rdd space and exus
workers so that we don't need to copy data so the last topic of today's talk is about what we can learn from the D
design of X boost spark uh we can see spark M framework fac facilitates us to implement
something like uh ex spark we can use those more efficient AG implementation with uh y spark ml framework but uh building a machine
learning pipeline is far beyond that we have more P points in in that for example uh system behavior is
tin binding with with the other input like data like for when especially when you input in from another machine learning model and also we depends on
the data uh maybe we want to Vision our training data sets to avoid like a trend model with some immature data set so all of this can be done with in
the future development of M framework or some other framework on top of a uh current SP ml
framework so in today's talk I introduced ex boost and exb Spark and uh we also discussed machine learning algorithm is actually a very small part
of the complete data processing or analytic p planine and we show example about how to resolve this headache we inand X boost to M framework to do that
and we also have a brief dis discussion about uh what What's our expect new expectations to spark and Spark ml
framework so finally uh let me give special thanks to Chi who created created the exbut project and offered a strong support when I buil exbut spark
and also thanks to exbut commenters contributors and users who keep working on improving this project and also thanks to Mel University where I get my
degree which sport me working on this uh project so this is the UR of our GitHub uh repo and jvm packages and thank you
very much [Applause] questions this is more like a comment
I'm not sure if you are uh familiar with deep learning for J um it's a um deep learning framework on top of spark they are using ND
nd4j which is also 4J yeah which is jvm gni interface to um off Heap memory so maybe you can use that so I think they're just holding a reference to a
off Heap memory address in their uh in their data frame or their rdd and uh most of the data is um in in Native um memory so I guess it could be an
alternative to a pach arrow um so maybe something that you could look into uh what's the question how how to so this is just more like a comment rather than
question okay okay uh you you mean how to integrate with aach arrow no I mean this is ND forg might be an alternative
to a patch arrow for spark and off hiip memory interaction oh what no okay offine yeah
okay let's talk offline yeah this um hi how do you actually use this in spark is that like a dependency I just put in my palm
or how do you use this in spark code how we use that in spark yeah uh like how do you bring this package into spark is this like a separate package I can just
mention it in my M and palm and then I have all the things available uh for use this you just uh um compile because our package depends on
spark you just compile that package that and the native library is also in the package jars and then you uh you can run
it in your cluster so I can just add this jar to my spark yeah Cod okay hey sorry forgive me if I'm behind the times but um spark XG boost on spark
does not yet support the ranking functionality of XG boost Standalone do you know if anyone's working on it's spal just like a several weeks ago
nice right on um and one more thing so just operational type stuff um spitting out a PML file from sparkml now that
still does it that works through this can you spit out a pmml file from spark on XG boost on spark uh I'm sorry
can can you spit out a pmml file you mentioned part of the glue code right is spitting out um yeah yeah something to productionize the code um
do you so does using uh are you able to spit out a pmml file from uh PM pipline yeah as as part from from the
package like a lot of spark ml type um algorithms can spit out the model in pmml form can I use uh XG boost ons spark and spit out a pmml file from it
yeah cool how I I'll talk to you I can I can meet you uh how to do that yeah
um so your goal is to use ex boo with the uh okay let let's disc
apply I have a question um so does this support uh can can I run this in Java spark or does it only support Scala
spark uh currently it only support Scala and uh poten potentially you can call with Java but no python pation uh no
what no uh we we only have Scala bindings right now okay okay cool thanks hi uh so there are lots of parameters that you can set for X boo I
was wondering how uh you do hyper parameter search uh generally we have some uh like guidance on tuning ex boost like there
are several most important ones in most of cases you just need to tune that and in some Corner cases you may want to involve other other things so with this you can like I think three or five are
most important ones just set that with Ms API and let cross validation to help you to do that so you uh another
question so uh the cross validation is not for sparkk right it's is cross validation working with spark
because I think it's for the like um nons spark jbm packages as far as I know uh can we about the non-spark no no uh
can you uh use cross validation with uh spark package of XG boost uh no no sorry and another
question uh have you uh tried to uh like sort of do cross validation with a spark and for some long running jobs because I
have tried to do that and I've seen some memory leaks uh for the spark version that uh is there any memory leaks yeah
uh until so far when I tried many times I didn't see that the of regarding memory uh the biggest issue in current implantation is we need to copy data and
that's a larger memory footprint so that's a big problem for memory leaks I haven't found that hi hi um last year when some of this work came out a number of people
tried using it and found very significant slowdowns as number of notes goes up oh slow the training speed is slow down training speed substantial
substantially sublinear yeah growth has the cause for that been discovered has there been changes recently to make that uh to make that go away uh there is some
cost because of the O reduce operation on each iteration so when you uh you are increasing the nose or comput Computing resource you you will have more
requirement to synchronize among all of them so that will become the biggest cost because on each worker you will have less training samples to train and the computation would not be the Bott
neck so in that case that the scalability is not that well so that's a kind of we expect it but of course we can improve that on our reduce operation
so is your advice to try to train with as large nodes as possible and as few of them in the spark cluster scenario uh as a
default you mean if we want to fully utilize that can it will is it just a stock at a certain size I'm looking for your advice on how to configure clusters
for optimal uh optimal training Performance Based okay based on your training data set size and
uh yeah based on your training data size and your feature size I mean the dimension of your feature vectors uh we
don't have any like uh offsh your Gardens to tunet you have to experiment and see how what's the B neck most of
cases when you have less data you will not want to use a lot of workers on that because as I said communication is a botton
EG and the other advice is uh you can explore to assign more CPUs to each spark task if you
are uh stuck at a Compu Computing part uh by assigning more CPUs and compare with open uh open MP then it will
utilize like mod Strat to build each tree then that will that will accelerate your your overall training so that's for when
you stuck at uh Computing and when you stuck at the communication okay I think there are more questions but let's take it offline and thank the speaker again okay thank
you
Loading video analysis...