Lecture 2: RPC and Threads
By MIT 6.824: Distributed Systems
Summary
## Key takeaways - **Go's Convenience: Threads, RPC, and Safety**: Go simplifies distributed programming with built-in support for threads, locking, and convenient remote procedure calls. It also offers type safety and memory safety, eliminating common bugs found in languages like C++. [00:51], [01:22] - **Threads for Concurrency and Parallelism**: Threads (or goroutines in Go) are essential for managing concurrency in distributed systems, enabling a program to handle multiple tasks simultaneously, such as waiting for network responses or performing computations across multiple CPU cores. [03:33], [08:00] - **Race Conditions: The Danger of Shared Memory**: When multiple threads access shared memory without proper synchronization, race conditions can occur, leading to unpredictable behavior. For example, incrementing a shared variable can result in an incorrect final value if not protected. [21:15], [22:02] - **Mitigating Races with Locks and Synchronization**: Locks (mutexes in Go) are used to protect shared data, ensuring that only one thread can access it at a time. This prevents race conditions, making code that accesses shared state safer, though it requires careful management by the programmer. [26:13], [43:47] - **Channels: Communicating Without Shared Memory**: Go's channels provide an alternative to shared memory for thread coordination. They allow threads to communicate by sending and receiving data, eliminating the need for locks and reducing the risk of race conditions. [33:34], [43:08] - **WaitGroups for Coordinating Goroutines**: WaitGroups are a synchronization primitive in Go that allow a program to wait for a specific number of goroutines to complete their tasks. This is crucial for ensuring that all concurrent operations have finished before proceeding. [34:06], [51:27]
Topics Covered
- Go's thread safety: Garbage collection and memory safety are key.
- Threads simplify complex asynchronous operations.
- Race conditions are insidious bugs that require deliberate tools to detect.
- Beware of infinite goroutine creation; use worker pools for bounded concurrency.
- Go's concurrency model avoids shared memory issues through channels.
Full Transcript
today I'd like to talk about NGO which
is interesting especially interesting
for us in this course because course NGO
is the language at the labs you're all
going to do the labs in and so I want to
focus today particularly on some of the
machinery that sort of most useful in
the labs and in most particular to
distributed programming um first of all
you know it's worth asking why we use go
in this class in fact we could have used
any one of a number of other system
style languages plenty languages like
Java or C sharp or even Python that
provide the kind of facilities we need
and indeed we used to use C++ in this
class and it worked out fine it'll go
indeed like many other languages
provides a bunch of features which are
particularly convenient
that's good support for threads and
locking and synchronization between
threads which we use a lot it is a
convenient remote procedure call package
which doesn't sound like much but it
actually turns out to be a significant
constraint from in languages like C++
for example it's actually a bit hard to
find a convenient easy to use remote
procedure call package and of course we
use it all the time in this course or
programs and different machines to talk
to each other unlike C++ go is type safe
and memory safe that is it's pretty hard
to write a program that due to a bug
scribbles over some random piece of
memory and then causes the program to do
mysterious things and that just
eliminates a big class of bugs similarly
it's garbage collected which means you
never in danger of priam the same memory
twice or free memory that's still in use
or something the garbage vector just
frees things when they stop being used
and one thing it's maybe not obvious
until you played around with just this
kind of programming before but the
combination of threads and garbage
collection is particularly important one
of the things that goes wrong in a non
garbage collected language like C++ if
you use threads is that it's always a
bit of a puzzle and requires a bunch of
bookkeeping to figure out when the last
thread
that's using a shared object has
finished using that object because only
then can you free the object as you end
up writing quite a bit of coat it's like
manually the programmer it's about a
bunch of code to manually you know do
reference counting or something in order
to figure out you know when the last
thread stopped using an object and
that's just a pain and that problem
completely goes away if you use garbage
collection like we haven't go
and finally the language is simple much
simpler than C++ one of the problems
with using C++ is that often if you made
an error you know maybe even just a typo
the the error message you get back from
the compiler is so complicated that in
C++ it's usually not worth trying to
figure out what the error message meant
and I find it's always just much quicker
to go look at the line number and try to
guess what the error must have been
because the language is far too
complicated
whereas go is you know probably doesn't
have a lot of people's favorite features
but it's relatively straightforward
language okay so at this point you're
both on the tutorial if you're looking
for sort of you know what to look at
next to learn about the language a good
place to look is the document titled
effective go which you know you can find
by searching the web all right the first
thing I want to talk about is threads
the reason why we care a lot about
threads in this course is that threads
are the sort of main tool we're going to
be using to manage concurrency in
programs and concurrency is a particular
interest in distributed programming
because it's often the case that one
program actually needs to talk to a
bunch of other computers you know client
may talk to many servers or a server may
be serving requests at the same time on
behalf of many different clients and so
we need a way to say oh you know I'm my
program really has seven different
things going on because it's talking to
seven different clients and I want a
simple way to allow it to do these seven
different things you know without too
much complex programming I mean sort of
thrust threads are the answer so these
are the things that the go documentation
calls go routines which I call threads
they're go routines are really this same
as what everybody else calls
Red's so the way to think of threads is
that you have a program of one program
and one address space I'm gonna draw a
box to sort of denote an address space
and within that address space in a
serial program without threads you just
have one thread of execution executing
code in that address space one program
counter one set of registers one stack
that are sort of describing the current
state of the execution in a threaded
program like a go program you could have
multiple threads and you know I got raw
it as multiple squiggly lines and when
each line represents really is a
separate if the especially if the
threads are executing at the same time
but a separate program counter a
separate set of registers and a separate
stack for each of the threads so that
they can have a sort of their own thread
of control and be executing each thread
in a different part of the program and
so hidden here is that for every stack
now there's a syrupy thread there's a
stack that it's executing on the stacks
are actually in in the one address space
of the program so even though each stack
each thread has its own stack
technically the they're all in the same
address space and different threads
could refer to each other stacks if they
knew the right addresses although you
typically don't do that and then go when
you even the main program you know when
you first start up the program and it
runs in main that's also it's just a go
routine and can do all the things that
go teens can do all right so as I
mentioned one of the big reasons is to
allow different parts of the program to
sort of be in its own point in in a
different activity so I usually refer to
that as IO concurrency for historical
reasons and the reason I call it IO
concurrency is that in the old days
where this first came up is that oh you
might have one thread that's waiting to
read
from the disk and while it's waiting to
reach from the disk you'd like to have a
second thread that maybe can compute or
read somewhere else in the disk or send
a message in the network and wait for
reply so and so I open currencies one of
the things that threads by you for us it
would usually mean I can I open currency
we usually mean I can have one program
that has launched or removed procedure
calls requests to different servers on
the network and is waiting for many
replies at the same time that's how
it'll come up for us and you know the
way you would do that with threads is
that you would create one thread for
each of the remote procedure calls that
you wanted to launch that thread would
have code that you know sent the remote
procedure call request message and sort
of waited at this point in the thread
and then finally when the reply came
back the thread would continue executing
and using threads allows us to have
multiple threads that all launch
requests into the network at the same
time they all wait or they don't have to
do it at the same time they can you know
execute the different parts of this
whenever they feel like it
so that's i/o concurrency sort of
overlapping of the progress of different
activities and allowing one activity is
waiting other activities can proceed
another big reason to use threads is
multi-core parallelism which I'll just
call parallelism and here the thing
where we'd be trying to achieve with
threads is if you have a multi-core
machine like I'm sure all of you do in
your laptops if you have a sort of
compute heavy job that needs a lot of
CPU cycles wouldn't it be nice if you
could have one program that could use
CPU cycles on all of the cores of the
machine and indeed if you write a
multi-threaded go if you launch multiple
go routines and go and they do something
computer intensive like sit there in a
loop and you know compute digits of pi
or something then up to the limit of the
number of cores in the physical machine
your threads will run truly in parallel
and if you launch you know two threads
instead of one you'll get twice as many
you'll be able to use twice as many CPU
cycles per second so this is very
important to some people it's not a big
deal on this course
be it's rare that we'll sort of think
specifically about this kind of
parallelism in the real world though of
building things like servers to form
parts of the distributed systems it can
sometimes be extremely important to be
able to have the server be able to run
threads and harness the CPU power of a
lot of cores just because the load from
clients can often be pretty high okay so
parallelism is a second reason why
threads are quite a bit interested in
distributed systems and a third reason
which is maybe a little bit less
important is there's some there's times
when you really just want to be able to
do something in the background or you
know there's just something you need to
do periodically and you don't want to
have to sort of in the main part of your
program sort of insert checks to say
well should I be doing this things that
should happen every second or so you
just like to be able to fire something
up that every second does whatever the
periodic thing is so there's some
convenience reasons and an example which
will come up for you is it's often the
case that some you know a master server
may want to check periodically whether
its workers are still alive because one
of them is died you know you want to
launch that work on another machine like
MapReduce might do that and one way to
arrange sort of oh do this check every
second every minute you know send a
message to the worker are you alive is
to fire off a go routine that just sits
in a loop that sleeps for a second and
then does the periodic thing and then
sleeps for a second again and so in the
labs you'll end up firing off these kind
of threads quite a bit yes is the
overhead worth it yes the overhead is
really pretty small for this stuff I
mean you know it depends on how many you
create a million threads that he sit in
a loop waiting for a millisecond and
then send a network message that's
probably a huge load on your machine but
if you create you know ten threads that
sleep for a second and do a little bit
of work it's probably not a big deal at
all and it's
I guarantee you the programmer time you
say by not having to sort of mush
together they're different different
activities into one line of code it's
it's worth the small amount of CPU cost
almost always still you know you will if
you're unlucky you'll discover in the
labs that some loop of yours is not
sleeping long enough or are you fired
off a bunch of these and never made them
exit for example and they just
accumulate so you can push it too far
okay so these are the reasons that the
main reasons that people like threads a
lot and that will use threads in this
class any other questions about threads
in general by asynchronous program you
mean like a single thread of control
that keeps state about many different
activities yeah so this is a good
question actually there is you know what
would happen if we didn't have threads
or we'd for some reason we didn't want
to use threats like how would we be able
to write a program that could you know a
server that could talk to many different
clients at the same time or a client
that could talk to him any servers right
what what tools could be used and it
turns out there is sort of another line
of another kind of another major style
of how do you structure these programs
called
you call the asynchronous program I
might call it a vent driven programming
so sort of or you could use a vent
prevent programming and the the general
structure of an event-driven program is
usually that it has a single thread and
a single loop and what that loop does is
sits there and waits for any input or
sort of any event that might trigger
processing so an event might be the
arrival of a request from a client or a
timer going off or if you're building a
Window System protect many Windows
systems on your laptops I've driven
written an event-driven style where what
they're waiting for is like key clicks
or Mouse move
or something so you might have a single
in an event-driven program it of a
single threat of control sits an aloof
waits for input and whenever it gets an
input like a packet it figures out oh
you know which client did this packet
come from and then it'll have a table of
sort of what the state is of whatever
activity its managing for that client
and it'll say oh gosh I was in the
middle of reading such-and-such a file
you know now it's asked me to read the
next block I'll go and be the next block
and return it and threats are generally
more convenient because they allow you
to really you know it's much easier to
write sequential just like straight
lines of control code that does you know
computes sends a message waits for
response whatever it's much easier to
write that kind of code in a thread than
it is to chop up whatever the activity
is into a bunch of little pieces that
can sort of be activated one at a time
by one of these event-driven loops that
said the well and so one problem with
the scheme is that it's it's a little
bit of a pain to program another
potential defect is that while you get
io concurrency from this approach you
don't get CPU parallelism so if you're
writing a busy server that would really
like to keep you know 32 cores busy on a
big server machine you know a single
loop is you know it's it's not a very
natural way to harness more than one
core on the other hand the overheads of
adventure and programming are generally
quite a bit less than threads you know
Ed's are pretty cheap but each one of
these threads is sitting on a stack you
know stack is a kilobyte or a kilobytes
or something you know if you have 20 of
these threads who cares if you have a
million of these threads then it's
starting to be a huge amount of memory
and you know maybe the scheduling
bookkeeping for deciding what the thread
to run next might also start you know
you now have list scheduling lists with
a thousand threads in them the threads
can start to get quite expensive so if
you are in a position where you need to
have a single server that sir
you know a million clients and has to
sort of keep a little bit of state for
each of a million clients this could be
expensive
and it's easier to write a very you know
at some expense in programmer time it's
easier to write a really stripped-down
efficient low overhead service in a
venture than programming just a lot more
work are you asking my JavaScript I
don't know the question is whether
JavaScript has multiple cores executing
your does anybody know depends on the
implementation yeah so I don't know I
mean it's a natural thought though even
in you know even an NGO you might well
want to have if you knew your machine
had eight cores if you wanted to write
the world's most efficient whatever
server you could fire up eight threads
and on each of the threads run sort of
stripped-down event-driven loop just you
know sort of one event loop Recor and
that you know that would be a way to get
both parallelism and to the bio
concurrency yes
okay so the question is what's the
difference between threads and processes
so usually on a like a UNIX machine a
process is a single program that you're
running and a sort of single address
space a single bunch of memory for the
process and inside a process you might
have multiple threads and when you ready
to go program and you run the go program
running the go program creates one unix
process and one sort of memory area and
then when your go process creates go
routines those are so sitting inside
that one process so I'm not sure that's
really an answer but just historically
the operating systems have provided like
this big box is the process that's
implemented by the operating system and
the individual and some of the operating
system does not care what happens inside
your process what language you use none
of the operating systems business but
inside that process you can run lots of
threads now you know if you run more
than one process in your machine you
know you run more than one program I can
edit or compiler the operating system
keep quite separate right you're your
editor and your compiler each have
memory but it's not the same memory that
are not allowed to look at each other
memory there's not much interaction
between different processes so you
redditor may have threads and your
compiler may have threads but they're
just in different worlds so within any
one program the threads can share memory
and can synchronize with channels and
use mutexes and stuff but between
processes there's just no no interaction
that's just a traditional structure of
these this kind of software
yeah
so the question is when a context switch
happens does it happened for all threads
okay so let's let's imagine you have a
single core machine that's really only
running and as just doing one thing at a
time maybe the right way to think about
it is that you're going to be you're
running multiple processes on your
machine the operating system will give
the CPU sort of time slicing back and
forth between these two programs so when
the hardware timer ticks and the
operating systems decides it's time to
take away the CPU from the currently
running process and give it to another
process that's done at a process level
it's complicated all right let me let me
let me restart this these the threads
that we use are based on threads that
are provided by the operating system in
the end and when the outer needs to some
context switches its switching between
the threads that it knows about so in a
situation like this the operating system
might know that there are two threads
here in this process and three threads
in this process and when the timer ticks
the operating system will based on some
scheduling algorithm pick a different
thread to run it might be a different
thread in this process or one of the
threads in this process
in addition go cleverly multiplex as
many go routines on top of single
operating system threads to reduce
overhead so it's really probably a two
stages of scheduling the operating
system picks which big thread to run and
then within that process go may have a
choice of go routines to run
all right okay so threads are convenient
because a lot of times they allow you to
write the code for each thread just as
if it were a pretty ordinary sequential
program however there are in fact some
challenges with writing threaded code
one is what to do about shared data one
of the really cool things about the
threading model is that these threads
share the same address space they share
memory if one thread creates an object
in memory you can let other threads use
it right you can have a array or
something that all the different threads
are reading and writing and that
sometimes critical right if you you know
if you're keeping some interesting state
you know maybe you have a cache of
things that your server your cache and
memory when a thread is handling a
client request it's gonna first look in
that cache but the shared cache and each
thread reads it and the threads may
write the cache to update it when they
have new information to stick in the
cache so it's really cool you can share
that memory but it turns out that it's
very very easy to get bugs if you're not
careful and you're sharing memory
between threads so a totally classic
example is you know supposing your
thread so you have a global variable N
and that's shared among the different
threads and a thread just wants to
increment n right but itself this is
likely to be an invitation to bugs right
if you don't do anything special around
this code and the reason is that you
know whenever you write code in a thread
that you you know is accessing reading
or writing data that's shared with other
threads you know there's always the
possibility and you got to keep in mind
that some other thread may be looking at
the data or modifying the data at the
same time so the obvious problem with
this is that maybe thread 1 is executing
this code and thread 2 is actually in
the same function in a different thread
executing the very same code right and
remember I'm imagining the N is a global
variable so they're talking about the
same n so what this boils down to you
know you're not actually running this
code you're running
machine code the compiler produced and
what that machine code does is it you
know it loads X into a register
you know adds one to the register and
then stores that register into X with
where X is a address of some location
and ran so you know you can count on
both of the threads
they're both executing this line of code
you know they both load the variable X
into a register effect starts out at 0
that means they both load at 0
they both increment that register so
they get one and they both store one
back to memory and now two threads of
incremented n and the resulting value is
1 which well who knows what the
programmer intended maybe that's what
the programmer wanted but chances are
not right chances are the programmer
wanted to not 1 some some instructions
are atomic so the question is a very
good question which it's whether
individual instructions are atomic so
the answer is some are and some aren't
so a store a 32-bit store is likely the
extremely likely to be atomic in the
sense that if 2 processors store at the
same time to the same memory address
32-bit values well you'll end up with is
either the 32 bits from one processor or
the 32 bits from the other processor but
not a mixture other sizes it's not so
clear like one byte stores it depends on
the CPU you using because a one byte
store is really almost certainly a 32
byte load and then a modification of 8
bits and a 32 byte store but it depends
on the processor and more complicated
instructions like increment your
microprocessor may well have an
increment instruction that can directly
increment some memory location like
pretty unlikely to be atomic although
there's atomic versions of some of these
instructions
so there's no way all right so this is
this is a just classic danger and it's
usually called a race I'm gonna come up
a lot is you're gonna do a lot of
threaded programming with shared state
race I think refers to as some ancient
class of bugs involving electronic
circuits but for us that you know the
reason why it's called a race is because
if one of the CPUs have started
executing this code and the other one
the others thread is sort of getting
close to this code it's sort of a race
as to whether the first processor can
finish and get to the store before the
second processor start status execute
the load if the first processor actually
manages it to do the store before the
second processor gets to the load then
the second processor will see the stored
value and the second processor will load
one and add one to it in store two
that's how you can justify this
terminology okay and so the way you
solve this certainly something this
simple is you insert locks
you know you as a programmer you have
some strategy in mind for locking the
data you can say well you know this
piece of shared data can only be used
when such-and-such a lock is held and
you'll see this and you may have used
this in the tutorial the go calls locks
mutexes so what you'll see is a mule Ock
before a sequence of code that uses
shared data and you unlock afterwards
and then whichever two threads execute
this when it to everyone is lucky enough
to get the lock first gets to do all
this stuff and finish before the other
one is allowed to proceed and so you can
think of wrapping a some code in a lock
as making a bunch of you know remember
this even though it's one line it's
really three distinct operations you can
think of a lock as causing this sort of
multi-step code sequence to be atomic
with respect to other people who have to
lock yes
should you can you repeat the
question
oh that's a great question the question
was how does go know which variable
we're walking right here of course is
only one variable but maybe we're saying
an equals x plus y really threes few
different variables and the answer is
that go has no idea it's not there's no
Association at all
anywhere between this lock so this new
thing is a variable which is a tight
mutex there's just there's no
association in the language between the
lock and any variables the associations
in the programmers head so as a
programmer you need to say oh here's a
bunch of shared data and any time you
modify any of it you know here's a
complex data structure say a tree or an
expandable hash table or something
anytime you're going to modify it and of
course a tree is composed many many
objects anytime you got to modify
anything that's associated with this
data structure you have to hold such and
such a lock right and of course is many
objects and instead of objects changes
because you might allocate new tree
nodes but it's really the programmer who
sort of works out a strategy for
ensuring that the data structure is used
by only one core at a time and so it
creates the one or maybe more locks and
there's many many locking strategies you
could apply to a tree you can imagine a
tree with a lock for every tree node the
programmer works out the strategy
allocates the locks and keeps in the
programmers head the relationship to the
data but go for go it's this is this
lock it's just like a very simple thing
there's a lock object the first thread
that calls lock gets the lock other
threads have to wait until none locks
and that's all go knows
yeah
does it not lock all variables that are
part of the object go doesn't know
anything about the relationship between
variables and locks so when you acquire
that lock when you have code that calls
lock exactly what it is doing it is
acquiring this lock and that's all this
does and anybody else who tries to lock
objects so somewhere else who would have
declared you know mutex knew all right
and this mu refers to some particular
lock object no and there me many many
locks right all this does is acquires
this lock and anybody else who wants to
acquire it has to wait until we unlock
this lock that's totally up to us as
programmers what we were protecting with
that lock so the question is is it
better to have the lock be a private the
private business of the data structure
like supposing it a zoning map yeah and
you know you would hope although it's
not true that map internally would have
a lock protecting it and that's a
reasonable strategy would be to have I
mean what would be to have it if you
define a data structure that needs to be
locked to have the lock be sort of
interior that have each of the data
structures methods be responsible for
acquiring that lock and the user the
data structure may never know that
that's pretty reasonable and the only
point at which that breaks down is that
um well it's a couple things one is if
the programmer knew that the data was
never shared they might be bummed that
they were paying the lock overhead for
something they knew didn't need to be
locked so that's one potential problem
the other is that if you if there's any
inter data structure of dependencies so
we have two data structures each with
locks and
and they maybe use each other then
there's a risk of cycles and deadlocks
right and the deadlocks can be solved
but the usual solutions to deadlocks
requires lifting the locks out of out of
the implementations up into the calling
code I will talk about that some point
but it's not a it's a good idea to hide
the locks but it's not always a good
idea all right okay so one problem you
run into with threads is these races and
generally you solve them with locks okay
or actually there's two big strategies
one is you figure out some locking
strategy for making access to the data
single thread one thread at a time or
yury you fix your code to not share data
if you can do that it's that's probably
better because it's less complex all
right so another issue that shows up
with leads threads is called
coordination when we're doing locking
the different threads involved probably
have no idea that the other ones exist
they just want to like be able to get
out the data without anybody else
interfering but there are also cases
where you need where you do
intentionally want different threads to
interact I want to wait for you
maybe you're producing some data you
know you're a different thread than me
you're you're producing data I'm gonna
wait until you've generated the data
before I read it right or you launch a
bunch of threads to say you crawl the
web and you want to wait for all those
fits to finish so there's times when we
intentionally want different to us to
interact with each other to wait for
each other
and that's usually called coordination
and there's a bunch of as you probably
know from having done the tutorial
there's a bunch of techniques in go for
doing this like channels
which are really about sending data from
one threat to another and breeding but
they did to be sent there's also other
stuff that more special purpose things
like there's a idea called condition
variables which is great if there's some
thread out there and you want to kick it
period you're not sure if the other
threads even waiting for you but if it
is waiting for you you just like to give
it a kick so it can well know that it
should continue whatever it's doing and
then there's wait group which is
particularly good for launching a a
known number of go routines and then
waiting for them Dolph to finish and a
final piece of damage that comes up with
threads deadlock the deadlock refers to
the general problem that you sometimes
run into where one thread
you know thread this thread is waiting
for thread two to produce something so
you know it's draw an arrow to say
thread one is waiting for thread two you
know for example thread one may be
waiting for thread two to release a lock
or to send something on the channel or
to you know decrement something in a
wait group however unfortunately maybe T
two is waiting for thread thread one to
do something and this is particularly
common in the case of locks its thread
one acquires lock a and thread to
acquire lock be so thread one is
acquired lock a throw two is required
lot B and then next thread one needs to
lock B also that is hold two locks which
sometimes shows up and it just so
happens that thread two needs to hold
block hey that's a deadlock all right at
least grab their first lock and then
proceed down to where they need their
second lock and now they're waiting for
each other forever right neither can
proceed neither then can release the
lock and usually just nothing happens so
if your program just kind of grinds to a
halt and doesn't seem to be doing
anything but didn't crash deadlock is
it's one thing to check
okay all right let's look at the web
crawler from the tutorial as an example
of some of this threading stuff I have a
couple of two solutions and different
styles are really three solutions in
different styles to allow us to talk a
bit about the details of some of this
thread programming so first of all you
all probably know web crawler its job is
you give it the URL of a page that it
starts at and you know many web pages
have links to other pages so what a web
crawler is trying to do is if that's the
first page extract all the URLs that
were mentioned that pages links you know
fetch the pages they point to look at
all those pages for the ules are all
those but all urls that they refer to
and keep on going until it's fetched all
the pages in the web let's just say and
then it should stop in addition the the
graph of pages and URLs is cyclic that
is if you're not careful
um you may end up following if you don't
remember oh I've already fetched this
web page already you may end up
following cycles forever and you know
your crawler will never finish so one of
the jobs of the crawler is to remember
the set of pages that is already crawled
or already even started a fetch for and
to not start a second fetch for any page
that it's already started fetching on
and you can think of that as sort of
imposing a tree structure finding a sort
of tree shaped subset of the cyclic
graph of actual web pages okay so we
want to avoid cycles we want to be able
to not fetch a page twice it also it
turns out that it just takes a long time
to fetch a web page but it's good
servers are slow and because the network
has a long speed of light latency and so
you definitely don't want to fetch pages
one at a time unless you want to crawl
to take many years so it pays enormous
lead to fetch many pages that same
I'm up to some limit right you want to
keep on increasing the number of pages
you fetch in parallel until the
throughput you're getting in pages per
second stops increasing that is running
increase the concurrency until you run
out of network capacity so we want to be
able to launch multiple fetches in
parallel and a final challenge which is
sometimes the hardest thing to solve is
to know when the crawl is finished
and once we've crawled all the pages we
want to stop and say we're done but we
actually need to write the code to
realize aha
we've crawled every single page and for
some solutions I've tried figuring out
when you're done has turned out to be
the hardest part all right so my first
crawler is this serial crawler here and
by the way this code is available on the
website under crawler go on the schedule
you won't look at it this wrist calls a
serial crawler it effectively performs a
depth-first search into the web graph
and there is sort of one moderately
interesting thing about it it keeps this
map called fetched which is basically
using as a set in order to remember
which pages it's crawled and that's like
the only interesting part of this you
give it a URL that at line 18 if it's
already fetched the URL it just returns
if it doesn't fetch the URL it first
remembers that it is now fetched it
actually gets fetches that page and
extracts the URLs that are in the page
with the fetcher and then iterates over
the URLs in that page and calls itself
for every one of those pages and it
passes to itself the way it it really
has just a one table there's only one
fetched map of course because you know
when I call recursive crawl and it
fetches a bunch of pages after it
returns I want to be where you know the
outer crawl instance needs to be aware
that certain pages are already fetched
so we depend very much on the fetched
map being passed between the functions
by reference instead of by copying so it
so under the hood what must really be
going on here is that go is passing a
pointer to the map object
to each of the calls of crawl so they
all share the pointer to the same object
and memory rather than copying rather
than copying than that any questions so
this code definitely does not solve the
problem that was posed right because it
doesn't launch parallel parallel fetches
now so clue we need to insert goroutines
somewhere in this code right to get
parallel fetches so let's suppose just
for chuckles dad we just start with the
most lazy thing because why so I'm gonna
just modify the code to run the
subsidiary crawls each in its own go
routine actually before I do that why
don't I run the code just to show you
what correct output looks like so hoping
this other window Emad run the crawler
it actually runs all three copies of the
crawler and they all find exactly the
same set of webpages so this is the
output that we're hoping to see five
lines five different web pages are are
fetched prints a line for each one so
let me now run the subsidiary crawls in
their own go routines and run that code
so what am I going to see the hope is to
fetch these webpages in parallel for
higher performance so okay so you're
voting for only seeing one URL and why
so why is that
yeah yes that's exactly right you know
after the after it's not gonna wait in
this loop at line 26 it's gonna zip
right through that loop I was gonna
fetch 1p when the ferry first webpage at
line 22 and then a loop it's gonna fly
off the girl routines and immediately
the scroll function is gonna return and
if it was called from main main what was
exit almost certainly before any of the
routines was able to do any work at all
so we'll probably just see the first web
page and I'm gonna do when I run it
you'll see here under serial that only
the one web page was found now in fact
since this program doesn't exit after
the serial crawler those Guru T's are
still running and they actually print
their output down here interleaved with
the next crawler example but
nevertheless the codes just adding a go
here absolutely doesn't work so let's
get rid of that okay so now I want to
show you a one style of concurrent
crawler and I'm presenting to one of
them written with shared data shared
objects and locks it's the first one and
another one written without shared data
but with passing information along
channels in order to coordinate the
different threads so this is the shared
data one or this is just one of many
ways of building a web crawler using
shared data so this code significantly
more complicated than a serial crawler
it creates a thread for each fetch it
does alright but the huge difference is
that it does with two things one it does
the bookkeeping required to notice when
all of the crawls have finished and it
handles the shared table of which URLs
have been crawled correctly so this code
still has this table of URLs and that's
this F dot fetched this F dot fetch
map at line 43 but this this table is
actually shared by all of the all of the
crawler threads and all the collar
threads are making or executing inside
concurrent mutex and so we still have
this sort of tree up in current mutexes
that's exploring different parts of the
web graph but each one of them was
launched as a as his own go routine
instead of as a function call but
they're all sharing this table of state
this table of test URLs because if one
go routine fetches a URL we don't want
another girl routine to accidentally
fetch the same URL and as you can see
here line 42 and 45 I've surrounded them
by the new taxes that are required to to
prevent a race that would occur if I
didn't add them new Texas so the danger
here is that at line 43 a thread is
checking of URLs already been fetched so
two threads happen to be following the
same URL now two calls to concurrent
mutex end up looking at the same URL
maybe because that URL was mentioned in
two different web pages if we didn't
have the lock they'd both access the
math table to see if the threaded and
then already if the URL had been already
fetched and they both get false at line
43 they both set the URLs entering the
table to true at line 44 and at 47 they
will both see that I already was false
and then they both go on to patch the
web page so we need the lock there and
the way to think about it I think is
that we want lines 43 and 44 to be
atomic that is we don't want some other
thread to to get in and be using the
table between 43 and 44 we we want to
read the current content each thread
wants to read the current table contents
and update it without any other thread
interfering and so that's what the locks
are doing for us okay so so actually any
questions about the about the locking
strategy here
all right once we check the URLs entry
in the table alliant 51 it just crawls
it just fetches that page in the usual
way and then the other thing interesting
thing that's going on is the launching
of the threads yes so the question is
what's with the F dot no no the MU it is
okay so there's a structure to find out
line 36 that sort of collects together
all the different stuff that all the
different state that we need to run this
crawl and here it's only two objects but
you know it could be a lot more and
they're only grouped together for
convenience there's no other
significance to the fact there's no deep
significance the fact that mu and fetch
store it inside the same structure and
that F dot is just sort of the syntax
are getting out one of the elements in
the structure so I just happened to put
them you in the structure because it
allows me to group together all the
stuff related to a crawl but that
absolutely does not mean that go
associates the MU with that structure or
with the fetch map or anything it's just
a lock objects and just has a lock
function you can call and that's all
that's going on
so the question is how come in order to
pass something by reference I had to use
star here where it is when a in the
previous example when we were passing a
map we didn't have to use star that is
didn't have to pass a pointer I mean
that star notation you're seeing there
in mine 41 basically and he's saying
that we're passing a pointer to this
fetch state object and we want it to be
a pointer because we want there to be
one object in memory and all the
different go routines I want to use that
same object so they all need a pointer
to that same object so so we need to
find your own structure that's sort of
the syntax you use for passing a pointer
the reason why we didn't have to do it
with map is because although it's not
clear from the syntax a map is a pointer
it's just because it's built into the
language they don't make you put a star
there but what a map is is if you
declare a variable type map what that is
is a pointer to some data in the heap so
it was a pointer anyway and it's always
passed by reference do they you just
don't have to put the star and it does
it for you
so there's they're definitely map is
special you cannot define map in the
language it's it has to be built in
because there's some curious things
about it okay good okay so we fetch the
page now we want to fire off a crawl go
routine for each URL mentioned in the
page we just fetch so that's done in
line 56 on line 50 sisters loops over
the URLs that the fetch function
returned and for each one fires off a go
routine at line 58 and that lines that
func syntax in line 58 is a closure or a
sort of immediate function but that func
thing keyword is doing is to clearing a
function right there that we then call
so the way to read it maybe is
that if you can declare a function as a
piece of data as just func you know and
then you give the arguments and then you
give the body and that's a clears and so
this is an object now this is like it's
like when you type one when you have a
one or 23 or something you're declaring
a sort of constant object and this is
the way to define a constant function
and we do it here because we want to
launch a go routine that's gonna run
this function that we declared right
here and so we in order to make the go
routine we have to add a go in front to
say we want to go routine and then we
have to call the function because the go
syntax says the syntax of the go
keywords as you follow it by a function
name and arguments you want to pass that
function and so we're gonna pass some
arguments here and there's two reasons
we're doing this well really this one
reason we you know in some other
circumstance we could have just said go
concurrent mutex oh I concur mutex is
the name of the function we actually
want to call with this URL but we want
to do a few other things as well so we
define this little helper function that
first calls concurrent mutex for us with
the URL and then after them current
mutex is finished we do something
special in order to help us wait for all
the crawls to be done before the outer
function returns so that brings us to
the the weight group the weight group at
line 55 it's a just a data structure to
find by go to help with coordination and
the game with weight group is that
internally it has a counter and you call
weight group dot add like a line 57 to
increment the counter and we group done
to decrement it and then this weight
what this weight method called line 63
waits for the counter to get down to
zero so a weight group is a way to wait
for a specific number of things to
finish and it's useful in a bunch of
different situations here we're using it
to wait for the last go routine to
finish
because we add one to the weight group
for every go routine we create line 60
at the end of this function we've
declared decrement the counter in the
weight group and then line three weights
until all the decrements have finished
and so the reason why we declared this
little function was basically to be able
to both call concurrently text and call
dot that's really why we needed that
function so the question is what if one
of the subroutines fails and doesn't
reach the done line that's a darn good
question there is you know if I forget
the exact range of errors that will
cause the go routine to fail without
causing the program to feel maybe
divides by zero I don't know where
dereference is a nil pointer
not sure but there are certainly ways
for a function to fail and I have the go
routine die without having the program
die and that would be a problem for us
and so really the white right way to I'm
sure you had this in mind and asking the
question the right way to write this to
be sure that the done call is made no
matter why this guru team is finishing
would be to put a defer here which means
call done before the surrounding
function finishes and always call it no
matter why the surrounding function is
finished yes
and yes yeah so the question is how come
two users have done in different threads
aren't a race yeah so the answer must be
that internally dot a weight group has a
mutex or something like it that each of
Dunn's methods acquires before doing
anything else so that simultaneously
calls to a done to await groups methods
are trees we could to did a low class
yeah for certain leaf C++ and in C you
want to look at something called P
threads for C threads come in a library
they're not really part of the language
called P threads which they have these
are extremely traditional and ancient
primitives that all languages yeah
say it again you know not in this code
but you know you could imagine a use of
weight groups I mean weight groups just
count stuff and yeah yeah yeah weight
group doesn't really care what you're
pounding or why I mean you know this is
the most common way to see it use you're
wondering why you is passed as a
parameter to the function at 58 okay
yeah this is alright so the question is
okay so actually backing up a little bit
the rules for these for a function like
the one I'm defining on 58 is that if
the function body mentions a variable
that's declared in the outer function
but not shadowed then the the inner
functions use of that is the same
variable in the inner function as in the
outer function and so that's what's
happening with Fechter for example like
what is this variable here refer to what
does the Fechter variable refer to in
the inner function well it refers it's
the same variable as as the fetcher in
the outer function says just is that
variable and so when the inner function
refers to fetcher it just means it's
just referring the same variable as this
one here and the same with F f is it's
used here it's just is this variable so
you might think that we could get rid of
the this u argument that we're passing
and just have the inner function take no
arguments at all but just use the U that
was defined up on line 56 in the loop
and it'll be nice if we could do that
because save us some typing it turns out
not to work and the reason is that the
semantics of go of the for loop at line
56 is that the
for the updates the variable you so in
the first iteration of the for loop that
variable u contains some URL and when
you enter the second iteration before
the that variable this contents are
changed to be the second URL and that
means that the first go routine that we
launched that's just looking at the
outer if it we're looking at the outer
functions u variable the that first go
team we launched would see a different
value in the u variable after the outer
function it updated it and sometimes
that's actually what you want so for
example for for F and then particular F
dot fetched we interaction absolutely
wants to see changes to that map but for
you we don't want to see changes the
first go routine we spawn should read
the first URL not the second URL so we
want that go routine to have a copy you
have its own private copy of the URL and
you know is we could have done it in
other ways we could have but the way
this code happens to do it to produce
the copy private to that inner function
is by passing the URLs in argument yes
yeah if we have passed the address of
you yeah then it uh it's actually I
don't know how strings work but it is
absolutely giving you your own private
copy of the variable you get your own
copy of the variable and it yeah
are you saying we don't need to play
this trick in the code we definitely
need to play this trick in the code and
what's going on is this it's so the
question is Oh strings are immutable
strings are immutable right yeah so how
kind of strings are immutable how can
the outer function change the string
there should be no problem the problem
is not that the string is changed the
problem is that the variable U is
changed so the when the inner function
mentions a variable that's defined in
the outer function it's referring to
that variable and the variables current
value so when you if you have a string
variable that has has a in it and then
you assign B to that string variable
you're not over writing the string
you're changing the variable to point to
a different string and and because the
for loop changes the U variable to point
to a different string you know that
change to you would be visible inside
the inner function and therefore the
inner function needs its own copy of the
variable
essentially make a copy of that so that
okay but that is what we're doing in
this code and that's that is why this
code works okay
the proposal or the broken code that
we're not using here I will show you the
broken code
this is just like a horrible detail but
it is unfortunately one that you'll run
into while doing the labs so you should
be at least where that there's a problem
and when you run into it maybe you can
try to figure out the details okay
that's a great question so so the
question is you know if you have an
inner function just a repeated if you
have an inner function that refers to a
variable in the surrounding function but
the surrounding function returns what is
the inner functions variable referring
to anymore since the outer function is
as returned and the answer is that go
notices go analyzes your inner functions
or these are called closures go analyzes
them the compiler analyze them says aha
oh this disclosure this inner function
is using a variable in the outer
function we're actually gonna and the
compiler will allocate heat memory to
hold the variable the you know the
current value of the variable and both
functions will refer to that that little
area heap that has the barrel so it
won't be allocated the variable won't be
on the stack as you might expect it's
moved to the heap if if the compiler
sees that it's using a closure and then
when the outer function returns the
object is still there in the heap the
inner function can still get at it and
then the garbage collector is
responsible for noticing that the last
function to refer to this little piece
of heat that's exited returned and to
free it only then okay okay
okay so wait group wait group is maybe
the more important thing here that the
technique that this code uses to wait
for all the all this level of crawls to
finished all its direct chill and the
finish is the wait group of course
there's many of these wait groups one
per call two concurrent mutex each call
that concurrent mutex just waits for its
own children to finish and then returns
okay so back to the lock actually
there's one more thing I want to talk
about with a lock and that is to explore
what would happen if we hadn't locked
right I'm claiming oh you know you don't
lock you're gonna get these races you're
gonna get incorrect execution whatever
let's give it a shot I'm gonna I'm gonna
comment out the locks and the question
is what happens if I run the code with
no locks what am I gonna see so we may
see a ru or I'll call twice or I fetch
twice yeah that's yeah that would be the
error you might expect alright so I'll
run it without locks and we're looking
at the concurrent map the one in the
middle this time it doesn't seem to have
fetched anything twice it's only five
run again gosh so far genius so maybe
we're wasting our time with those locks
yeah never seems to go wrong I've
actually never seem to go wrong so the
code is nevertheless wrong and someday
it will fail okay the problem is that
you know this is only a couple of
instructions here and so the chances of
these two threads which are maybe
hundreds of instructions happening to
stumble on this you know the same couple
of instructions at the same time is
quite low and indeed and and this is a
real bummer about buggy code with races
is that it usually works just fine but
it probably won't work when the customer
runs it on their computer
so it's actually bad news for us right
what do we you know it it can be in
complex programs quite difficult to
figure out if you have a race right and
you might you may have code that just
looks completely reasonable that is in
fact sort of unknown to you using shared
variables and the answer is you really
the only way to find races in practice
to be is you automated tools and luckily
go actually gives us this pretty good
race detector built-in to go and you
should use it so if you pass the - race
flag when you have to get your go
program and run this race detector which
well I'll run the race detector and
we'll see so it emits an error message
from us it's found a race and it
actually tells us exactly where the race
happened so there's a lot of junk in
this output but the really critical
thing is that the race detector realize
that we had read a variable that's what
this read is that was previously written
and there was no intervening release and
acquire of a lock that's what that's
what this means furthermore it tells us
the line number so it's told us that the
read was a line 43 and the write the
previous write was at line 44 and indeed
we look at the code and the read isn't
line 43 and the right is at lying 44 so
that means that one thread did a write
at line 44 and then without any
intervening lock and another thread came
along and read that written data at line
43 that's basically what the race
detector is looking for the way it works
internally is it allocates sort of
shadow memory now lucky some you know it
uses a huge amount of memory and
basically for every one of your memory
locations the race detector is allocated
a little bit of memory itself in which
it keeps track of which threads recently
read or wrote every single memory
location and then when and it also to
keep tracking keeping track of when
threads acquiring release locks and do
other synchronization activities that it
knows forces but force threads to not
run
and if the race detector driver sees a
ha there was a memory location that was
written and then read with no
intervening market it'll raise an error
yes I believe it is not perfect yeah I
have to think about it what one
certainly one way it is not perfect is
that if you if you don't execute some
code the race detector doesn't know
anything about it so it's not analyzing
it's not doing static analysis the
racing sectors not looking at your
source and making decisions based on the
source it's sort of watching what
happened at on this particular run of
the program and so if this particular
run of the program didn't execute some
code that happens to read or write
shared data then the race detector will
never know and there could be erased
there so that's certainly something to
watch out for so you know if you're
serious about the race detector you need
to set up sort of testing apparatus that
tries to make sure all all the code is
executed but it's it's it's very good
and you just have to use it for your 8
to 4 lives okay so this is race here and
of course the race didn't actually occur
what the race editor did not see was the
actual interleaving simultaneous
execution of some sensitive code right
it didn't see two threads literally
execute lines 43 and 44 at the same time
and as we know from having run the
things by hand that apparently doesn't
happen only with low probability all it
saw was at one point that was a right
and they made me much later there was a
read with no intervening walk and so
enact in that sense it can sort of
detect races that didn't actually happen
or didn't really cause bugs okay
okay one final question about this this
crawler how many threads does it create
yeah and how many concurrent threads
could there be yeah so a defect in this
crawler is that there's no obvious bound
on the number of simultaneous threads
that might create you know with the test
case which only has five URLs big
whoopee but if you're crawling a real
wheel web with you know I don't know are
there billions of URLs out there maybe
not we certainly don't want to be in a
position where the crawler might
accidentally create billions of threads
because you know thousands of threads
it's just fine billions of threads it's
not okay because each one sits on some
amount of memory so a you know there's
probably many defects in real life for
this crawler but one at the level we're
talking about is that it does create too
many threads and really ought to have a
way of saying well you can create 20
threads or 100 threads or a thousand
threads but no more so one way to do
that would be to pre create a pool a
fixed size pool of workers and have the
workers just iteratively look for
another URL to crawl crawl that URL
rather than creating a new thread for
each URL okay so next up I want to talk
about a another crawler that's
implemented and a significantly
different way using channels instead of
shared memory it's a member on the mutex
call or I just said there is this table
of URLs that are called that's shared
between all the threads and asked me
locked this version does not have such a
table does not share memory and does not
need to use locks okay so this one the
instead there's basically a master
thread that's his master function on a
decent 986 and it has a table but the
table is private to the master function
and what the master function is doing is
instead of sort of basically creating a
tree of functions that corresponds to
the exploration of the graph which the
previous crawler did this one fires off
one ute one guru team per URL that it's
fetches and that but it's only the
master only the one master that's
creating these threads so we don't have
a tree of functions creating threads we
just have the one master okay so it
creates its own private map a line 88
this record what it's fetched and then
it also creates a channel just a single
channel that all of its worker threads
are going to talk to and the idea is
that it's gonna fire up a worker thread
and each worker thread that it fires up
when it finished such as fetching the
page will send exactly one item back to
the master on the channel and that item
will be a list of the URLs in the page
that that worker thread fetched so the
master sits in a loop we're in line
eighty nine is reading entries from the
channel and so we have to imagine that
it's started up some workers in advance
and now it's reading the information the
URL lists that those workers send back
and each time he gets a URL is sitting
on land eighty nine it then loops over
the URLs in that URL list from a single
page fetch align ninety and if the URL
hasn't already been fetched it fires off
a new worker at line 94 to fetch that
URL and if we look at the worker code
online starting line 77 basically calls
his fetcher and then sends a message on
the channel a line 80 or 82 saying
here's the URLs in the page they fetched
and notice that now that the maybe
interesting thing about this is that the
worker threads don't share any objects
there's no shared object between the
workers and the master so we don't have
to worry about locking we don't have to
worry about rhesus instead this is a
example of sort of communicating
information instead of getting at it
through shared memory yes
yeah yeah so the observation is that the
code appears but the workers are the
observation is the workers are modifying
ch while the Masters reading it and
that's not the way the go authors would
like you to think about this the way
they want you to think about this is
that CH is a channel and the channel has
send and receive operations and the
workers are sending on the channel while
the master receives on the channel and
that's perfectly legal the channel is
happy I mean what that really means is
that the internal implementation of
channel has a mutex in it and the
channel operations are careful to take
out the mutex when they're messing with
the channels internal data to ensure
that it doesn't actually have any
reasons in it but yeah channels are sort
of protected against concurrency and
you're allowed to use them concurrently
from different threads yes
over the channel receive yes
we don't need to close the channel I
mean okay the the break statement is
about when the crawl has completely
finished and we fetched every single URL
right because hey what's going on is the
master is keeping I mean this n value is
private value and a master every time it
fires off a worker at increments the end
though every worker it starts since
exactly one item on the channel and so
every time the master reads an item off
the channel it knows that one of his
workers is finished and when the number
of outstanding workers goes to zero then
we're done and we don't once the number
of outstanding workers goes to zero then
the only reference to the channel is
from the master or from oh really from
the code that calls the master and so
the garbage collector will very soon see
that the channel has no references to it
and will free the channel so in this
case sometimes you need to close
channels but actually I rarely have to
close channels
he said again
so the question is alright so you can
see at line 106 before calling master
concurrent channel sort of fires up one
shoves one URL into the channel and it's
to sort of get the whole thing started
because the code for master was written
you know the master goes right into
reading from the channel line 89 so
there better be something in the channel
otherwise line 89 would block forever so
if it weren't for that little code at
line 107 the for loop at 89 would block
reading from the channel forever and
this code wouldn't work well yeah so the
observation is gosh you know wouldn't it
be nice to be able to write code that
would be able to notice if there's
nothing waiting on the channel and you
can if you look up the Select statement
it's much more complicated than this but
there is the Select statement which
allows you to proceed to not block if
something if there's nothing waiting on
the channel
because the work resin finish okay sorry
to the first question is there I think
what you're really worried about is
whether we're actually able to launch
parallel so the very first step won't be
in parallel because there's an exit
owner the for-loop weights in at line 89
that's not okay that for loop at line 89
is does not just loop over the current
contents of the channel and then quit
that is the for loop at 89 is going to
read it may never exit but it's gonna
read it's just going to keep waiting
until something shows up in the channel
so if you don't hit the break at line 99
the for loop own exit yeah alright I'm
afraid we're out of time we'll continue
this actually we have a presentation
scheduled by the TAS which I'll talk
more about go
Loading video analysis...