LongCut logo

Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu

By Databricks

Summary

Topics Covered

  • Full Video

Full Transcript

hello everyone uh welcome to my talk today I will talk about how to build a unified machine Le machine learning Pipeline with X boost and Spark uh my

name is NW uh I'm a software engineer from Microsoft uh in Microsoft I work on integrated spark streaming St streaming with as event hubs which is a real time

messaging ingestion service on Asia and my colleague and I will give another talk on this topic at 5:00 pm today so welcome to that talk as well I also work

on test monitoring and optimization for spark workload performance I serve as a Committee Member of a patmax net and dmlc I also

contribute to spark so about dmrc distributed the machine Learning Community uh we a group of Engineers researchers from different

organizations we collaborate on open source machine learning projects uh some of the projects we build like H boost which is a major topic today and

maxnet which is one of the most popular de learning Frameworks so in today's talk I will first uh introduce X boost and exbut

Spark I will talk about uh the project history uh some brief introduction about how exbut works but due uh due to the time limitation here I will not go into

agre details and the formula derivations uh we have a paper on kdd 2016 uh with plenty of details about uh the algorithm if you're interested in

that you can read read it and I will go into details about why we want to integrate x boost and Spark uh I'll say what's the real Pro problem

we want to resolve by integrate these two systems and I will uh give more details about the design of XG boost and finally we will have a brief discussion

on what we can learn from Exo spark design so this is my personal contribution to uh X boost

project and what is XG boost XG boost is a gr in boost tree implantation uh it was created by T Chen who is a PhD

student in University of Washington in 2014 and uh until to today it had been more and more measure uh it provides bindings for different programming

languages you can run X boost on single machine distributed machines and you can run it with GPU for ex boost spark it is the real

effort to integrate ex boost and Spark uh the idea was generated in a discussion in nip 2015 between T and me and in March of 16 we released the first

version of EXO sparkk which is based on rdd and the after the summer of the semi year we have the second version which is

based on Dat frame and the M framework uh we have a gring community around X boost uh more than half of winning Solutions in K competition are

using X boost and uh for execut spark we attract a lot of users from different companies like Airbnb Microsoft Uber and our developers are from

University of Washington Microsoft uptech and some other companies so then let's go technically deeper to

introduce uh X boost as I said X boost is a grent boost Deion tree implementation to understand this concept I will separate into two three

parts first what is a Deion tree and second what is the boosted Deion trees and uh uh seert what is a great in the boosted D

Tre for D Tre in XG boost we adopt one type of dig tree models called classification and regression trees and

like many other D Tree models uh a card assigns the de points to the leaf note of a tree structure based on the ru on different level of

the tree for example here we work on a problem like we want to predict whether this PR were like computer games so we have the first rule on the

root level but uh which separate the data sets based on the age of the person and then in the lower level in the younger group we have we based maybe

because of that a lot of games are not properly tored to girls interest we can uh separate dat set further based on uh the gender of the person and specific

two card classification and regression tree is to were assign a score to the leaf node of this tree and the de points associating with that leaf node will get

that score and use that score as a best to answer a predictive question and here the youngest boy get the plus two uh it means that it were uh he is mostly like

computer games then what is Deion tree boosting decision tree boosting means that in our machine learning model we use more than

one trees for example here we introduce a new tree which separate the data sets into two parts with the new rule that is based on whether this person will use computer

daily and specific to a single data point the overall score it gets from these multi- tree models is the sum of

all score it gets from every tree so the idea here is called insample learning in uh General machine learning context that means we use multiple weaker Learners to

achieve better performance than anyone alone and in X boost for each iteration we will build a new tree the remaining question is that

specific to each iteration Al say each tree how can we determine this is the shape of this this tree is the optimal one we want to get then it involves the

third part about our of our introduction to X boost what is the gradient booster tree before I introduce granted booster tree uh I will let's review some concept

in supervised learning supervised learning means that we have some labored samples from certain data distribution and our task is to find a

function to describe this distribution based on our observations these OB observations are those labor samples so to achieve the goal we introduce a

objective function here this objective function contains two parts first is a trending loss this part measures how well our model will fit with those labor

samples that's the training data and the second item is called regation item it controls the complexity

of the model because we only get some of the some some of the some samples from that distribution we don't want to build a model which is too complex which can

uh meet with those laboral samples where Weare but ignore those unobserved uh samples in the same distribution and in ex boost we also

have this objective function that means in each iteration we want to find a function which represent the optimal TR

structure we want to get to uh in this iteration so how to find this this function we have to answer two questions

here first specific to this tree how to determine the score on the leaf node so the idea here is that we the first

step for us is to represent the object function as something related to the score on the leaf node so here we Define the function as a vector of the score of

the leaes and another function which Maps the de points to C certain Leaf node and for regularization item uh we

Define it as the number of nodes and plus the sum of the square the scores of the Lea nodes and then we can transform our objective function as something

related to the score and after a chain of Transformations we can get the uh optimal value of the score and the smallest value of of objective function

and uh I will not go go into how to derive that if you are interested in that read our ktd 2016 papers and the second question to design

um tree structure is how to decide the the best splitting point for a specific node in this Tre structure for example here why we want to separate the root

node at 15 the idea is that since each node were associate with a lot of trending sorts

and for each feature these sort may present some uh continuous values we want to avoid enumerate each value to find the best split so we propose some

uh we transform the continuous official values as a discrete buckets and then these buckets should presume most the even data distribution and among this

buet the boundary of these buets we will take the one which bring the largest loss reduction so how to achieve this are also described with details in the paper

so uh if you're interested read our papers we also have other optimizations in X boost we can build a tree with mod stress we can scale the training with

dis space you can also train in a distributed way so we have talk about so many good things about X boost so let's build a machine learning pipeline in production

with X boost so how to do that first when we build a machine learning pipeline we have some raw data with different formats in different places and this raw data may be broken they

have duplicate duplicate columns they have missing values we have to click clean it so spark is one of the best tools for us to do that cleaning and

then maybe you have a different teams in your organizations uh or you have different preference you have to save the clean data in different formats in different

places finally we can introduce X booster here load the training data build the model don't forget that X boost is just

a component in your overall infrastructure you have to make H boost collaborate with your supporting infrastructure and for your Trend model

you have to Output in a certain format to serve in production and because we want to make X boost to load different uh different format of data to

coordinate with our supporting infrastructure we have to introduce a lot of gluc cod to bind X boost in your infrastructure and in X sorry in nips

2015 there is a famous paper from Google te they point out that a metor system might end up with at most 5% of machine

learning car code the X boost is a machine learning car code and it will be with at least 95% of gluc code and

unfortunately this gluc code is costly in long term because it will bind your infrastructure bind your infrastructure with a certain machine learning cor code

and you can it will be very hard for you to test Alternatives so how to resolve this problem uh let's review the design of another system which we might be more

familiar with that's spark M Le for spark Mr Le it provides a framework for different machine learning

AGM this AGM access outside data with a standard sparker SQL API and because this agance run with in framework we

don't need to introduce any additional gluc code here since spark has handled everything and additionally spark M A framework provides a set of tools for us

to do FAL engineering tuning and a lot of other tasks so the idea here is very clear our goal is to make X boost WR as

a native algorithm in spark ml framework so how can we do that first we know that spark is a

cluster Computing framework and it is centered on rdd so our first mission is to make X boost and Spark communicate in

execution and memory layer so for execution layer let's first review uh how distributed the training with exost Works uh in exost we have a

central control control process called the tracker and we have distributed workers and then for each worker it will

take a partition of training data and locally it will compute some local uh States and then with all reduced

algorithm it will synchronize this States and finally with those synchr States it will the tree our task is to run this model in

spark cluster the first thing for us to do is to start the tracker in driver side in driver side we start a Tracker with a process and then in executer site

in each spark task we will call the we will start the X boost worker with jni interface with a uh Scala code and then

these workers will communicate with each other with the original or reduced layer in X boost this is the Y excution layer then

in memory layer because we want to make make EX X boost and Spark communicate between uh Native memory space and rdd memory space so we just copy the data

with the J interface from rdd memory space to Native memory space and the trend model and get model in Java space one of the implementation details

here is that gni calling is very expensive we don't want to copy sample by sample so here we copy in batches and then with the native code we will

interpret this memory batch and uh transform it to the format which can be read by native code native car code of XG

boost so all of these details will be wrapped by our API for example here users can load the data with a standard spark cql API and then with the plan

scal code is configure X boost and um finally with our one light API it's just the trend model get that and do predition

now we can run ex boost within spark cluster the next thing is how to integrate spark ml framework because ml framework provides so many good things

feature engineering tools pipeline persistence tools so how to integrate with this framework let's first review how the

machine learning pipeline looks like with spark M framework first we have some raw data and through first set of Transformers we can pre-process those

data this Transformers can be stren index for example it were uh transform the categorical string features into numerical index and with one H encoder

and finally we arrive to estimator this estimator will produce another Transformer sorry that is actually the model we want to train and

then with this model we can do some prediction result oh we can get prediction result and this pipeline is separated into two parts first is trending and the second is prediction

and in many cases we want to use prediction result to fill it back to the trending phase and tune our

models so then our first task is to how to make ex boot Trend as a native spark M algorithm and here we Implement spark

uh ex boost estimator by extending spark M estimator uh class and uh trigger the distributed workers and next how to make

X boost predict as a native spark M model and we Implement X boost model by extending the Transformer

framework and finally how to make it make H boost tunable as a uh Native spark M algorithm so we Implement ex

boost parameter system and make it accessible by many in spark M framework because we Implement all these

things we can build a four machine learning Pipeline with X boost and uh spark uh ml framework here is the

example for ex first we build some pre-processing pipeline all of this are done with the uh spark M utils and apis

and then we search the optimal parameters for X boost with the cross validation tour this is also from Mr framework and we evaluate the

performance of X boost with the spark uh gbt classifier we use a airline data set we combine several years together and we

run it on aure then we can see X boost is 10 times faster than spark gbt classifier we also improving the

efficiency of X boost Spark uh here we one of the important work we are doing is a unified memory space as we said we always copy the data from rdd

space to Native memory space and it will increase the memory footprint now we are trying to use a patch Arrow to share the memory space between rdd space and exus

workers so that we don't need to copy data so the last topic of today's talk is about what we can learn from the D

design of X boost spark uh we can see spark M framework fac facilitates us to implement

something like uh ex spark we can use those more efficient AG implementation with uh y spark ml framework but uh building a machine

learning pipeline is far beyond that we have more P points in in that for example uh system behavior is

tin binding with with the other input like data like for when especially when you input in from another machine learning model and also we depends on

the data uh maybe we want to Vision our training data sets to avoid like a trend model with some immature data set so all of this can be done with in

the future development of M framework or some other framework on top of a uh current SP ml

framework so in today's talk I introduced ex boost and exb Spark and uh we also discussed machine learning algorithm is actually a very small part

of the complete data processing or analytic p planine and we show example about how to resolve this headache we inand X boost to M framework to do that

and we also have a brief dis discussion about uh what What's our expect new expectations to spark and Spark ml

framework so finally uh let me give special thanks to Chi who created created the exbut project and offered a strong support when I buil exbut spark

and also thanks to exbut commenters contributors and users who keep working on improving this project and also thanks to Mel University where I get my

degree which sport me working on this uh project so this is the UR of our GitHub uh repo and jvm packages and thank you

very much [Applause] questions this is more like a comment

I'm not sure if you are uh familiar with deep learning for J um it's a um deep learning framework on top of spark they are using ND

nd4j which is also 4J yeah which is jvm gni interface to um off Heap memory so maybe you can use that so I think they're just holding a reference to a

off Heap memory address in their uh in their data frame or their rdd and uh most of the data is um in in Native um memory so I guess it could be an

alternative to a pach arrow um so maybe something that you could look into uh what's the question how how to so this is just more like a comment rather than

question okay okay uh you you mean how to integrate with aach arrow no I mean this is ND forg might be an alternative

to a patch arrow for spark and off hiip memory interaction oh what no okay offine yeah

okay let's talk offline yeah this um hi how do you actually use this in spark is that like a dependency I just put in my palm

or how do you use this in spark code how we use that in spark yeah uh like how do you bring this package into spark is this like a separate package I can just

mention it in my M and palm and then I have all the things available uh for use this you just uh um compile because our package depends on

spark you just compile that package that and the native library is also in the package jars and then you uh you can run

it in your cluster so I can just add this jar to my spark yeah Cod okay hey sorry forgive me if I'm behind the times but um spark XG boost on spark

does not yet support the ranking functionality of XG boost Standalone do you know if anyone's working on it's spal just like a several weeks ago

nice right on um and one more thing so just operational type stuff um spitting out a PML file from sparkml now that

still does it that works through this can you spit out a pmml file from spark on XG boost on spark uh I'm sorry

can can you spit out a pmml file you mentioned part of the glue code right is spitting out um yeah yeah something to productionize the code um

do you so does using uh are you able to spit out a pmml file from uh PM pipline yeah as as part from from the

package like a lot of spark ml type um algorithms can spit out the model in pmml form can I use uh XG boost ons spark and spit out a pmml file from it

yeah cool how I I'll talk to you I can I can meet you uh how to do that yeah

um so your goal is to use ex boo with the uh okay let let's disc

apply I have a question um so does this support uh can can I run this in Java spark or does it only support Scala

spark uh currently it only support Scala and uh poten potentially you can call with Java but no python pation uh no

what no uh we we only have Scala bindings right now okay okay cool thanks hi uh so there are lots of parameters that you can set for X boo I

was wondering how uh you do hyper parameter search uh generally we have some uh like guidance on tuning ex boost like there

are several most important ones in most of cases you just need to tune that and in some Corner cases you may want to involve other other things so with this you can like I think three or five are

most important ones just set that with Ms API and let cross validation to help you to do that so you uh another

question so uh the cross validation is not for sparkk right it's is cross validation working with spark

because I think it's for the like um nons spark jbm packages as far as I know uh can we about the non-spark no no uh

can you uh use cross validation with uh spark package of XG boost uh no no sorry and another

question uh have you uh tried to uh like sort of do cross validation with a spark and for some long running jobs because I

have tried to do that and I've seen some memory leaks uh for the spark version that uh is there any memory leaks yeah

uh until so far when I tried many times I didn't see that the of regarding memory uh the biggest issue in current implantation is we need to copy data and

that's a larger memory footprint so that's a big problem for memory leaks I haven't found that hi hi um last year when some of this work came out a number of people

tried using it and found very significant slowdowns as number of notes goes up oh slow the training speed is slow down training speed substantial

substantially sublinear yeah growth has the cause for that been discovered has there been changes recently to make that uh to make that go away uh there is some

cost because of the O reduce operation on each iteration so when you uh you are increasing the nose or comput Computing resource you you will have more

requirement to synchronize among all of them so that will become the biggest cost because on each worker you will have less training samples to train and the computation would not be the Bott

neck so in that case that the scalability is not that well so that's a kind of we expect it but of course we can improve that on our reduce operation

so is your advice to try to train with as large nodes as possible and as few of them in the spark cluster scenario uh as a

default you mean if we want to fully utilize that can it will is it just a stock at a certain size I'm looking for your advice on how to configure clusters

for optimal uh optimal training Performance Based okay based on your training data set size and

uh yeah based on your training data size and your feature size I mean the dimension of your feature vectors uh we

don't have any like uh offsh your Gardens to tunet you have to experiment and see how what's the B neck most of

cases when you have less data you will not want to use a lot of workers on that because as I said communication is a botton

EG and the other advice is uh you can explore to assign more CPUs to each spark task if you

are uh stuck at a Compu Computing part uh by assigning more CPUs and compare with open uh open MP then it will

utilize like mod Strat to build each tree then that will that will accelerate your your overall training so that's for when

you stuck at uh Computing and when you stuck at the communication okay I think there are more questions but let's take it offline and thank the speaker again okay thank

you

Loading...

Loading video analysis...