LongCut logo

Lecture 2: RPC and Threads

By MIT 6.824: Distributed Systems

Summary

## Key takeaways - **Go's Convenience: Threads, RPC, and Safety**: Go simplifies distributed programming with built-in support for threads, locking, and convenient remote procedure calls. It also offers type safety and memory safety, eliminating common bugs found in languages like C++. [00:51], [01:22] - **Threads for Concurrency and Parallelism**: Threads (or goroutines in Go) are essential for managing concurrency in distributed systems, enabling a program to handle multiple tasks simultaneously, such as waiting for network responses or performing computations across multiple CPU cores. [03:33], [08:00] - **Race Conditions: The Danger of Shared Memory**: When multiple threads access shared memory without proper synchronization, race conditions can occur, leading to unpredictable behavior. For example, incrementing a shared variable can result in an incorrect final value if not protected. [21:15], [22:02] - **Mitigating Races with Locks and Synchronization**: Locks (mutexes in Go) are used to protect shared data, ensuring that only one thread can access it at a time. This prevents race conditions, making code that accesses shared state safer, though it requires careful management by the programmer. [26:13], [43:47] - **Channels: Communicating Without Shared Memory**: Go's channels provide an alternative to shared memory for thread coordination. They allow threads to communicate by sending and receiving data, eliminating the need for locks and reducing the risk of race conditions. [33:34], [43:08] - **WaitGroups for Coordinating Goroutines**: WaitGroups are a synchronization primitive in Go that allow a program to wait for a specific number of goroutines to complete their tasks. This is crucial for ensuring that all concurrent operations have finished before proceeding. [34:06], [51:27]

Topics Covered

  • Go's thread safety: Garbage collection and memory safety are key.
  • Threads simplify complex asynchronous operations.
  • Race conditions are insidious bugs that require deliberate tools to detect.
  • Beware of infinite goroutine creation; use worker pools for bounded concurrency.
  • Go's concurrency model avoids shared memory issues through channels.

Full Transcript

today I'd like to talk about NGO which

is interesting especially interesting

for us in this course because course NGO

is the language at the labs you're all

going to do the labs in and so I want to

focus today particularly on some of the

machinery that sort of most useful in

the labs and in most particular to

distributed programming um first of all

you know it's worth asking why we use go

in this class in fact we could have used

any one of a number of other system

style languages plenty languages like

Java or C sharp or even Python that

provide the kind of facilities we need

and indeed we used to use C++ in this

class and it worked out fine it'll go

indeed like many other languages

provides a bunch of features which are

particularly convenient

that's good support for threads and

locking and synchronization between

threads which we use a lot it is a

convenient remote procedure call package

which doesn't sound like much but it

actually turns out to be a significant

constraint from in languages like C++

for example it's actually a bit hard to

find a convenient easy to use remote

procedure call package and of course we

use it all the time in this course or

programs and different machines to talk

to each other unlike C++ go is type safe

and memory safe that is it's pretty hard

to write a program that due to a bug

scribbles over some random piece of

memory and then causes the program to do

mysterious things and that just

eliminates a big class of bugs similarly

it's garbage collected which means you

never in danger of priam the same memory

twice or free memory that's still in use

or something the garbage vector just

frees things when they stop being used

and one thing it's maybe not obvious

until you played around with just this

kind of programming before but the

combination of threads and garbage

collection is particularly important one

of the things that goes wrong in a non

garbage collected language like C++ if

you use threads is that it's always a

bit of a puzzle and requires a bunch of

bookkeeping to figure out when the last

thread

that's using a shared object has

finished using that object because only

then can you free the object as you end

up writing quite a bit of coat it's like

manually the programmer it's about a

bunch of code to manually you know do

reference counting or something in order

to figure out you know when the last

thread stopped using an object and

that's just a pain and that problem

completely goes away if you use garbage

collection like we haven't go

and finally the language is simple much

simpler than C++ one of the problems

with using C++ is that often if you made

an error you know maybe even just a typo

the the error message you get back from

the compiler is so complicated that in

C++ it's usually not worth trying to

figure out what the error message meant

and I find it's always just much quicker

to go look at the line number and try to

guess what the error must have been

because the language is far too

complicated

whereas go is you know probably doesn't

have a lot of people's favorite features

but it's relatively straightforward

language okay so at this point you're

both on the tutorial if you're looking

for sort of you know what to look at

next to learn about the language a good

place to look is the document titled

effective go which you know you can find

by searching the web all right the first

thing I want to talk about is threads

the reason why we care a lot about

threads in this course is that threads

are the sort of main tool we're going to

be using to manage concurrency in

programs and concurrency is a particular

interest in distributed programming

because it's often the case that one

program actually needs to talk to a

bunch of other computers you know client

may talk to many servers or a server may

be serving requests at the same time on

behalf of many different clients and so

we need a way to say oh you know I'm my

program really has seven different

things going on because it's talking to

seven different clients and I want a

simple way to allow it to do these seven

different things you know without too

much complex programming I mean sort of

thrust threads are the answer so these

are the things that the go documentation

calls go routines which I call threads

they're go routines are really this same

as what everybody else calls

Red's so the way to think of threads is

that you have a program of one program

and one address space I'm gonna draw a

box to sort of denote an address space

and within that address space in a

serial program without threads you just

have one thread of execution executing

code in that address space one program

counter one set of registers one stack

that are sort of describing the current

state of the execution in a threaded

program like a go program you could have

multiple threads and you know I got raw

it as multiple squiggly lines and when

each line represents really is a

separate if the especially if the

threads are executing at the same time

but a separate program counter a

separate set of registers and a separate

stack for each of the threads so that

they can have a sort of their own thread

of control and be executing each thread

in a different part of the program and

so hidden here is that for every stack

now there's a syrupy thread there's a

stack that it's executing on the stacks

are actually in in the one address space

of the program so even though each stack

each thread has its own stack

technically the they're all in the same

address space and different threads

could refer to each other stacks if they

knew the right addresses although you

typically don't do that and then go when

you even the main program you know when

you first start up the program and it

runs in main that's also it's just a go

routine and can do all the things that

go teens can do all right so as I

mentioned one of the big reasons is to

allow different parts of the program to

sort of be in its own point in in a

different activity so I usually refer to

that as IO concurrency for historical

reasons and the reason I call it IO

concurrency is that in the old days

where this first came up is that oh you

might have one thread that's waiting to

read

from the disk and while it's waiting to

reach from the disk you'd like to have a

second thread that maybe can compute or

read somewhere else in the disk or send

a message in the network and wait for

reply so and so I open currencies one of

the things that threads by you for us it

would usually mean I can I open currency

we usually mean I can have one program

that has launched or removed procedure

calls requests to different servers on

the network and is waiting for many

replies at the same time that's how

it'll come up for us and you know the

way you would do that with threads is

that you would create one thread for

each of the remote procedure calls that

you wanted to launch that thread would

have code that you know sent the remote

procedure call request message and sort

of waited at this point in the thread

and then finally when the reply came

back the thread would continue executing

and using threads allows us to have

multiple threads that all launch

requests into the network at the same

time they all wait or they don't have to

do it at the same time they can you know

execute the different parts of this

whenever they feel like it

so that's i/o concurrency sort of

overlapping of the progress of different

activities and allowing one activity is

waiting other activities can proceed

another big reason to use threads is

multi-core parallelism which I'll just

call parallelism and here the thing

where we'd be trying to achieve with

threads is if you have a multi-core

machine like I'm sure all of you do in

your laptops if you have a sort of

compute heavy job that needs a lot of

CPU cycles wouldn't it be nice if you

could have one program that could use

CPU cycles on all of the cores of the

machine and indeed if you write a

multi-threaded go if you launch multiple

go routines and go and they do something

computer intensive like sit there in a

loop and you know compute digits of pi

or something then up to the limit of the

number of cores in the physical machine

your threads will run truly in parallel

and if you launch you know two threads

instead of one you'll get twice as many

you'll be able to use twice as many CPU

cycles per second so this is very

important to some people it's not a big

deal on this course

be it's rare that we'll sort of think

specifically about this kind of

parallelism in the real world though of

building things like servers to form

parts of the distributed systems it can

sometimes be extremely important to be

able to have the server be able to run

threads and harness the CPU power of a

lot of cores just because the load from

clients can often be pretty high okay so

parallelism is a second reason why

threads are quite a bit interested in

distributed systems and a third reason

which is maybe a little bit less

important is there's some there's times

when you really just want to be able to

do something in the background or you

know there's just something you need to

do periodically and you don't want to

have to sort of in the main part of your

program sort of insert checks to say

well should I be doing this things that

should happen every second or so you

just like to be able to fire something

up that every second does whatever the

periodic thing is so there's some

convenience reasons and an example which

will come up for you is it's often the

case that some you know a master server

may want to check periodically whether

its workers are still alive because one

of them is died you know you want to

launch that work on another machine like

MapReduce might do that and one way to

arrange sort of oh do this check every

second every minute you know send a

message to the worker are you alive is

to fire off a go routine that just sits

in a loop that sleeps for a second and

then does the periodic thing and then

sleeps for a second again and so in the

labs you'll end up firing off these kind

of threads quite a bit yes is the

overhead worth it yes the overhead is

really pretty small for this stuff I

mean you know it depends on how many you

create a million threads that he sit in

a loop waiting for a millisecond and

then send a network message that's

probably a huge load on your machine but

if you create you know ten threads that

sleep for a second and do a little bit

of work it's probably not a big deal at

all and it's

I guarantee you the programmer time you

say by not having to sort of mush

together they're different different

activities into one line of code it's

it's worth the small amount of CPU cost

almost always still you know you will if

you're unlucky you'll discover in the

labs that some loop of yours is not

sleeping long enough or are you fired

off a bunch of these and never made them

exit for example and they just

accumulate so you can push it too far

okay so these are the reasons that the

main reasons that people like threads a

lot and that will use threads in this

class any other questions about threads

in general by asynchronous program you

mean like a single thread of control

that keeps state about many different

activities yeah so this is a good

question actually there is you know what

would happen if we didn't have threads

or we'd for some reason we didn't want

to use threats like how would we be able

to write a program that could you know a

server that could talk to many different

clients at the same time or a client

that could talk to him any servers right

what what tools could be used and it

turns out there is sort of another line

of another kind of another major style

of how do you structure these programs

called

you call the asynchronous program I

might call it a vent driven programming

so sort of or you could use a vent

prevent programming and the the general

structure of an event-driven program is

usually that it has a single thread and

a single loop and what that loop does is

sits there and waits for any input or

sort of any event that might trigger

processing so an event might be the

arrival of a request from a client or a

timer going off or if you're building a

Window System protect many Windows

systems on your laptops I've driven

written an event-driven style where what

they're waiting for is like key clicks

or Mouse move

or something so you might have a single

in an event-driven program it of a

single threat of control sits an aloof

waits for input and whenever it gets an

input like a packet it figures out oh

you know which client did this packet

come from and then it'll have a table of

sort of what the state is of whatever

activity its managing for that client

and it'll say oh gosh I was in the

middle of reading such-and-such a file

you know now it's asked me to read the

next block I'll go and be the next block

and return it and threats are generally

more convenient because they allow you

to really you know it's much easier to

write sequential just like straight

lines of control code that does you know

computes sends a message waits for

response whatever it's much easier to

write that kind of code in a thread than

it is to chop up whatever the activity

is into a bunch of little pieces that

can sort of be activated one at a time

by one of these event-driven loops that

said the well and so one problem with

the scheme is that it's it's a little

bit of a pain to program another

potential defect is that while you get

io concurrency from this approach you

don't get CPU parallelism so if you're

writing a busy server that would really

like to keep you know 32 cores busy on a

big server machine you know a single

loop is you know it's it's not a very

natural way to harness more than one

core on the other hand the overheads of

adventure and programming are generally

quite a bit less than threads you know

Ed's are pretty cheap but each one of

these threads is sitting on a stack you

know stack is a kilobyte or a kilobytes

or something you know if you have 20 of

these threads who cares if you have a

million of these threads then it's

starting to be a huge amount of memory

and you know maybe the scheduling

bookkeeping for deciding what the thread

to run next might also start you know

you now have list scheduling lists with

a thousand threads in them the threads

can start to get quite expensive so if

you are in a position where you need to

have a single server that sir

you know a million clients and has to

sort of keep a little bit of state for

each of a million clients this could be

expensive

and it's easier to write a very you know

at some expense in programmer time it's

easier to write a really stripped-down

efficient low overhead service in a

venture than programming just a lot more

work are you asking my JavaScript I

don't know the question is whether

JavaScript has multiple cores executing

your does anybody know depends on the

implementation yeah so I don't know I

mean it's a natural thought though even

in you know even an NGO you might well

want to have if you knew your machine

had eight cores if you wanted to write

the world's most efficient whatever

server you could fire up eight threads

and on each of the threads run sort of

stripped-down event-driven loop just you

know sort of one event loop Recor and

that you know that would be a way to get

both parallelism and to the bio

concurrency yes

okay so the question is what's the

difference between threads and processes

so usually on a like a UNIX machine a

process is a single program that you're

running and a sort of single address

space a single bunch of memory for the

process and inside a process you might

have multiple threads and when you ready

to go program and you run the go program

running the go program creates one unix

process and one sort of memory area and

then when your go process creates go

routines those are so sitting inside

that one process so I'm not sure that's

really an answer but just historically

the operating systems have provided like

this big box is the process that's

implemented by the operating system and

the individual and some of the operating

system does not care what happens inside

your process what language you use none

of the operating systems business but

inside that process you can run lots of

threads now you know if you run more

than one process in your machine you

know you run more than one program I can

edit or compiler the operating system

keep quite separate right you're your

editor and your compiler each have

memory but it's not the same memory that

are not allowed to look at each other

memory there's not much interaction

between different processes so you

redditor may have threads and your

compiler may have threads but they're

just in different worlds so within any

one program the threads can share memory

and can synchronize with channels and

use mutexes and stuff but between

processes there's just no no interaction

that's just a traditional structure of

these this kind of software

yeah

so the question is when a context switch

happens does it happened for all threads

okay so let's let's imagine you have a

single core machine that's really only

running and as just doing one thing at a

time maybe the right way to think about

it is that you're going to be you're

running multiple processes on your

machine the operating system will give

the CPU sort of time slicing back and

forth between these two programs so when

the hardware timer ticks and the

operating systems decides it's time to

take away the CPU from the currently

running process and give it to another

process that's done at a process level

it's complicated all right let me let me

let me restart this these the threads

that we use are based on threads that

are provided by the operating system in

the end and when the outer needs to some

context switches its switching between

the threads that it knows about so in a

situation like this the operating system

might know that there are two threads

here in this process and three threads

in this process and when the timer ticks

the operating system will based on some

scheduling algorithm pick a different

thread to run it might be a different

thread in this process or one of the

threads in this process

in addition go cleverly multiplex as

many go routines on top of single

operating system threads to reduce

overhead so it's really probably a two

stages of scheduling the operating

system picks which big thread to run and

then within that process go may have a

choice of go routines to run

all right okay so threads are convenient

because a lot of times they allow you to

write the code for each thread just as

if it were a pretty ordinary sequential

program however there are in fact some

challenges with writing threaded code

one is what to do about shared data one

of the really cool things about the

threading model is that these threads

share the same address space they share

memory if one thread creates an object

in memory you can let other threads use

it right you can have a array or

something that all the different threads

are reading and writing and that

sometimes critical right if you you know

if you're keeping some interesting state

you know maybe you have a cache of

things that your server your cache and

memory when a thread is handling a

client request it's gonna first look in

that cache but the shared cache and each

thread reads it and the threads may

write the cache to update it when they

have new information to stick in the

cache so it's really cool you can share

that memory but it turns out that it's

very very easy to get bugs if you're not

careful and you're sharing memory

between threads so a totally classic

example is you know supposing your

thread so you have a global variable N

and that's shared among the different

threads and a thread just wants to

increment n right but itself this is

likely to be an invitation to bugs right

if you don't do anything special around

this code and the reason is that you

know whenever you write code in a thread

that you you know is accessing reading

or writing data that's shared with other

threads you know there's always the

possibility and you got to keep in mind

that some other thread may be looking at

the data or modifying the data at the

same time so the obvious problem with

this is that maybe thread 1 is executing

this code and thread 2 is actually in

the same function in a different thread

executing the very same code right and

remember I'm imagining the N is a global

variable so they're talking about the

same n so what this boils down to you

know you're not actually running this

code you're running

machine code the compiler produced and

what that machine code does is it you

know it loads X into a register

you know adds one to the register and

then stores that register into X with

where X is a address of some location

and ran so you know you can count on

both of the threads

they're both executing this line of code

you know they both load the variable X

into a register effect starts out at 0

that means they both load at 0

they both increment that register so

they get one and they both store one

back to memory and now two threads of

incremented n and the resulting value is

1 which well who knows what the

programmer intended maybe that's what

the programmer wanted but chances are

not right chances are the programmer

wanted to not 1 some some instructions

are atomic so the question is a very

good question which it's whether

individual instructions are atomic so

the answer is some are and some aren't

so a store a 32-bit store is likely the

extremely likely to be atomic in the

sense that if 2 processors store at the

same time to the same memory address

32-bit values well you'll end up with is

either the 32 bits from one processor or

the 32 bits from the other processor but

not a mixture other sizes it's not so

clear like one byte stores it depends on

the CPU you using because a one byte

store is really almost certainly a 32

byte load and then a modification of 8

bits and a 32 byte store but it depends

on the processor and more complicated

instructions like increment your

microprocessor may well have an

increment instruction that can directly

increment some memory location like

pretty unlikely to be atomic although

there's atomic versions of some of these

instructions

so there's no way all right so this is

this is a just classic danger and it's

usually called a race I'm gonna come up

a lot is you're gonna do a lot of

threaded programming with shared state

race I think refers to as some ancient

class of bugs involving electronic

circuits but for us that you know the

reason why it's called a race is because

if one of the CPUs have started

executing this code and the other one

the others thread is sort of getting

close to this code it's sort of a race

as to whether the first processor can

finish and get to the store before the

second processor start status execute

the load if the first processor actually

manages it to do the store before the

second processor gets to the load then

the second processor will see the stored

value and the second processor will load

one and add one to it in store two

that's how you can justify this

terminology okay and so the way you

solve this certainly something this

simple is you insert locks

you know you as a programmer you have

some strategy in mind for locking the

data you can say well you know this

piece of shared data can only be used

when such-and-such a lock is held and

you'll see this and you may have used

this in the tutorial the go calls locks

mutexes so what you'll see is a mule Ock

before a sequence of code that uses

shared data and you unlock afterwards

and then whichever two threads execute

this when it to everyone is lucky enough

to get the lock first gets to do all

this stuff and finish before the other

one is allowed to proceed and so you can

think of wrapping a some code in a lock

as making a bunch of you know remember

this even though it's one line it's

really three distinct operations you can

think of a lock as causing this sort of

multi-step code sequence to be atomic

with respect to other people who have to

lock yes

should you can you repeat the

question

oh that's a great question the question

was how does go know which variable

we're walking right here of course is

only one variable but maybe we're saying

an equals x plus y really threes few

different variables and the answer is

that go has no idea it's not there's no

Association at all

anywhere between this lock so this new

thing is a variable which is a tight

mutex there's just there's no

association in the language between the

lock and any variables the associations

in the programmers head so as a

programmer you need to say oh here's a

bunch of shared data and any time you

modify any of it you know here's a

complex data structure say a tree or an

expandable hash table or something

anytime you're going to modify it and of

course a tree is composed many many

objects anytime you got to modify

anything that's associated with this

data structure you have to hold such and

such a lock right and of course is many

objects and instead of objects changes

because you might allocate new tree

nodes but it's really the programmer who

sort of works out a strategy for

ensuring that the data structure is used

by only one core at a time and so it

creates the one or maybe more locks and

there's many many locking strategies you

could apply to a tree you can imagine a

tree with a lock for every tree node the

programmer works out the strategy

allocates the locks and keeps in the

programmers head the relationship to the

data but go for go it's this is this

lock it's just like a very simple thing

there's a lock object the first thread

that calls lock gets the lock other

threads have to wait until none locks

and that's all go knows

yeah

does it not lock all variables that are

part of the object go doesn't know

anything about the relationship between

variables and locks so when you acquire

that lock when you have code that calls

lock exactly what it is doing it is

acquiring this lock and that's all this

does and anybody else who tries to lock

objects so somewhere else who would have

declared you know mutex knew all right

and this mu refers to some particular

lock object no and there me many many

locks right all this does is acquires

this lock and anybody else who wants to

acquire it has to wait until we unlock

this lock that's totally up to us as

programmers what we were protecting with

that lock so the question is is it

better to have the lock be a private the

private business of the data structure

like supposing it a zoning map yeah and

you know you would hope although it's

not true that map internally would have

a lock protecting it and that's a

reasonable strategy would be to have I

mean what would be to have it if you

define a data structure that needs to be

locked to have the lock be sort of

interior that have each of the data

structures methods be responsible for

acquiring that lock and the user the

data structure may never know that

that's pretty reasonable and the only

point at which that breaks down is that

um well it's a couple things one is if

the programmer knew that the data was

never shared they might be bummed that

they were paying the lock overhead for

something they knew didn't need to be

locked so that's one potential problem

the other is that if you if there's any

inter data structure of dependencies so

we have two data structures each with

locks and

and they maybe use each other then

there's a risk of cycles and deadlocks

right and the deadlocks can be solved

but the usual solutions to deadlocks

requires lifting the locks out of out of

the implementations up into the calling

code I will talk about that some point

but it's not a it's a good idea to hide

the locks but it's not always a good

idea all right okay so one problem you

run into with threads is these races and

generally you solve them with locks okay

or actually there's two big strategies

one is you figure out some locking

strategy for making access to the data

single thread one thread at a time or

yury you fix your code to not share data

if you can do that it's that's probably

better because it's less complex all

right so another issue that shows up

with leads threads is called

coordination when we're doing locking

the different threads involved probably

have no idea that the other ones exist

they just want to like be able to get

out the data without anybody else

interfering but there are also cases

where you need where you do

intentionally want different threads to

interact I want to wait for you

maybe you're producing some data you

know you're a different thread than me

you're you're producing data I'm gonna

wait until you've generated the data

before I read it right or you launch a

bunch of threads to say you crawl the

web and you want to wait for all those

fits to finish so there's times when we

intentionally want different to us to

interact with each other to wait for

each other

and that's usually called coordination

and there's a bunch of as you probably

know from having done the tutorial

there's a bunch of techniques in go for

doing this like channels

which are really about sending data from

one threat to another and breeding but

they did to be sent there's also other

stuff that more special purpose things

like there's a idea called condition

variables which is great if there's some

thread out there and you want to kick it

period you're not sure if the other

threads even waiting for you but if it

is waiting for you you just like to give

it a kick so it can well know that it

should continue whatever it's doing and

then there's wait group which is

particularly good for launching a a

known number of go routines and then

waiting for them Dolph to finish and a

final piece of damage that comes up with

threads deadlock the deadlock refers to

the general problem that you sometimes

run into where one thread

you know thread this thread is waiting

for thread two to produce something so

you know it's draw an arrow to say

thread one is waiting for thread two you

know for example thread one may be

waiting for thread two to release a lock

or to send something on the channel or

to you know decrement something in a

wait group however unfortunately maybe T

two is waiting for thread thread one to

do something and this is particularly

common in the case of locks its thread

one acquires lock a and thread to

acquire lock be so thread one is

acquired lock a throw two is required

lot B and then next thread one needs to

lock B also that is hold two locks which

sometimes shows up and it just so

happens that thread two needs to hold

block hey that's a deadlock all right at

least grab their first lock and then

proceed down to where they need their

second lock and now they're waiting for

each other forever right neither can

proceed neither then can release the

lock and usually just nothing happens so

if your program just kind of grinds to a

halt and doesn't seem to be doing

anything but didn't crash deadlock is

it's one thing to check

okay all right let's look at the web

crawler from the tutorial as an example

of some of this threading stuff I have a

couple of two solutions and different

styles are really three solutions in

different styles to allow us to talk a

bit about the details of some of this

thread programming so first of all you

all probably know web crawler its job is

you give it the URL of a page that it

starts at and you know many web pages

have links to other pages so what a web

crawler is trying to do is if that's the

first page extract all the URLs that

were mentioned that pages links you know

fetch the pages they point to look at

all those pages for the ules are all

those but all urls that they refer to

and keep on going until it's fetched all

the pages in the web let's just say and

then it should stop in addition the the

graph of pages and URLs is cyclic that

is if you're not careful

um you may end up following if you don't

remember oh I've already fetched this

web page already you may end up

following cycles forever and you know

your crawler will never finish so one of

the jobs of the crawler is to remember

the set of pages that is already crawled

or already even started a fetch for and

to not start a second fetch for any page

that it's already started fetching on

and you can think of that as sort of

imposing a tree structure finding a sort

of tree shaped subset of the cyclic

graph of actual web pages okay so we

want to avoid cycles we want to be able

to not fetch a page twice it also it

turns out that it just takes a long time

to fetch a web page but it's good

servers are slow and because the network

has a long speed of light latency and so

you definitely don't want to fetch pages

one at a time unless you want to crawl

to take many years so it pays enormous

lead to fetch many pages that same

I'm up to some limit right you want to

keep on increasing the number of pages

you fetch in parallel until the

throughput you're getting in pages per

second stops increasing that is running

increase the concurrency until you run

out of network capacity so we want to be

able to launch multiple fetches in

parallel and a final challenge which is

sometimes the hardest thing to solve is

to know when the crawl is finished

and once we've crawled all the pages we

want to stop and say we're done but we

actually need to write the code to

realize aha

we've crawled every single page and for

some solutions I've tried figuring out

when you're done has turned out to be

the hardest part all right so my first

crawler is this serial crawler here and

by the way this code is available on the

website under crawler go on the schedule

you won't look at it this wrist calls a

serial crawler it effectively performs a

depth-first search into the web graph

and there is sort of one moderately

interesting thing about it it keeps this

map called fetched which is basically

using as a set in order to remember

which pages it's crawled and that's like

the only interesting part of this you

give it a URL that at line 18 if it's

already fetched the URL it just returns

if it doesn't fetch the URL it first

remembers that it is now fetched it

actually gets fetches that page and

extracts the URLs that are in the page

with the fetcher and then iterates over

the URLs in that page and calls itself

for every one of those pages and it

passes to itself the way it it really

has just a one table there's only one

fetched map of course because you know

when I call recursive crawl and it

fetches a bunch of pages after it

returns I want to be where you know the

outer crawl instance needs to be aware

that certain pages are already fetched

so we depend very much on the fetched

map being passed between the functions

by reference instead of by copying so it

so under the hood what must really be

going on here is that go is passing a

pointer to the map object

to each of the calls of crawl so they

all share the pointer to the same object

and memory rather than copying rather

than copying than that any questions so

this code definitely does not solve the

problem that was posed right because it

doesn't launch parallel parallel fetches

now so clue we need to insert goroutines

somewhere in this code right to get

parallel fetches so let's suppose just

for chuckles dad we just start with the

most lazy thing because why so I'm gonna

just modify the code to run the

subsidiary crawls each in its own go

routine actually before I do that why

don't I run the code just to show you

what correct output looks like so hoping

this other window Emad run the crawler

it actually runs all three copies of the

crawler and they all find exactly the

same set of webpages so this is the

output that we're hoping to see five

lines five different web pages are are

fetched prints a line for each one so

let me now run the subsidiary crawls in

their own go routines and run that code

so what am I going to see the hope is to

fetch these webpages in parallel for

higher performance so okay so you're

voting for only seeing one URL and why

so why is that

yeah yes that's exactly right you know

after the after it's not gonna wait in

this loop at line 26 it's gonna zip

right through that loop I was gonna

fetch 1p when the ferry first webpage at

line 22 and then a loop it's gonna fly

off the girl routines and immediately

the scroll function is gonna return and

if it was called from main main what was

exit almost certainly before any of the

routines was able to do any work at all

so we'll probably just see the first web

page and I'm gonna do when I run it

you'll see here under serial that only

the one web page was found now in fact

since this program doesn't exit after

the serial crawler those Guru T's are

still running and they actually print

their output down here interleaved with

the next crawler example but

nevertheless the codes just adding a go

here absolutely doesn't work so let's

get rid of that okay so now I want to

show you a one style of concurrent

crawler and I'm presenting to one of

them written with shared data shared

objects and locks it's the first one and

another one written without shared data

but with passing information along

channels in order to coordinate the

different threads so this is the shared

data one or this is just one of many

ways of building a web crawler using

shared data so this code significantly

more complicated than a serial crawler

it creates a thread for each fetch it

does alright but the huge difference is

that it does with two things one it does

the bookkeeping required to notice when

all of the crawls have finished and it

handles the shared table of which URLs

have been crawled correctly so this code

still has this table of URLs and that's

this F dot fetched this F dot fetch

map at line 43 but this this table is

actually shared by all of the all of the

crawler threads and all the collar

threads are making or executing inside

concurrent mutex and so we still have

this sort of tree up in current mutexes

that's exploring different parts of the

web graph but each one of them was

launched as a as his own go routine

instead of as a function call but

they're all sharing this table of state

this table of test URLs because if one

go routine fetches a URL we don't want

another girl routine to accidentally

fetch the same URL and as you can see

here line 42 and 45 I've surrounded them

by the new taxes that are required to to

prevent a race that would occur if I

didn't add them new Texas so the danger

here is that at line 43 a thread is

checking of URLs already been fetched so

two threads happen to be following the

same URL now two calls to concurrent

mutex end up looking at the same URL

maybe because that URL was mentioned in

two different web pages if we didn't

have the lock they'd both access the

math table to see if the threaded and

then already if the URL had been already

fetched and they both get false at line

43 they both set the URLs entering the

table to true at line 44 and at 47 they

will both see that I already was false

and then they both go on to patch the

web page so we need the lock there and

the way to think about it I think is

that we want lines 43 and 44 to be

atomic that is we don't want some other

thread to to get in and be using the

table between 43 and 44 we we want to

read the current content each thread

wants to read the current table contents

and update it without any other thread

interfering and so that's what the locks

are doing for us okay so so actually any

questions about the about the locking

strategy here

all right once we check the URLs entry

in the table alliant 51 it just crawls

it just fetches that page in the usual

way and then the other thing interesting

thing that's going on is the launching

of the threads yes so the question is

what's with the F dot no no the MU it is

okay so there's a structure to find out

line 36 that sort of collects together

all the different stuff that all the

different state that we need to run this

crawl and here it's only two objects but

you know it could be a lot more and

they're only grouped together for

convenience there's no other

significance to the fact there's no deep

significance the fact that mu and fetch

store it inside the same structure and

that F dot is just sort of the syntax

are getting out one of the elements in

the structure so I just happened to put

them you in the structure because it

allows me to group together all the

stuff related to a crawl but that

absolutely does not mean that go

associates the MU with that structure or

with the fetch map or anything it's just

a lock objects and just has a lock

function you can call and that's all

that's going on

so the question is how come in order to

pass something by reference I had to use

star here where it is when a in the

previous example when we were passing a

map we didn't have to use star that is

didn't have to pass a pointer I mean

that star notation you're seeing there

in mine 41 basically and he's saying

that we're passing a pointer to this

fetch state object and we want it to be

a pointer because we want there to be

one object in memory and all the

different go routines I want to use that

same object so they all need a pointer

to that same object so so we need to

find your own structure that's sort of

the syntax you use for passing a pointer

the reason why we didn't have to do it

with map is because although it's not

clear from the syntax a map is a pointer

it's just because it's built into the

language they don't make you put a star

there but what a map is is if you

declare a variable type map what that is

is a pointer to some data in the heap so

it was a pointer anyway and it's always

passed by reference do they you just

don't have to put the star and it does

it for you

so there's they're definitely map is

special you cannot define map in the

language it's it has to be built in

because there's some curious things

about it okay good okay so we fetch the

page now we want to fire off a crawl go

routine for each URL mentioned in the

page we just fetch so that's done in

line 56 on line 50 sisters loops over

the URLs that the fetch function

returned and for each one fires off a go

routine at line 58 and that lines that

func syntax in line 58 is a closure or a

sort of immediate function but that func

thing keyword is doing is to clearing a

function right there that we then call

so the way to read it maybe is

that if you can declare a function as a

piece of data as just func you know and

then you give the arguments and then you

give the body and that's a clears and so

this is an object now this is like it's

like when you type one when you have a

one or 23 or something you're declaring

a sort of constant object and this is

the way to define a constant function

and we do it here because we want to

launch a go routine that's gonna run

this function that we declared right

here and so we in order to make the go

routine we have to add a go in front to

say we want to go routine and then we

have to call the function because the go

syntax says the syntax of the go

keywords as you follow it by a function

name and arguments you want to pass that

function and so we're gonna pass some

arguments here and there's two reasons

we're doing this well really this one

reason we you know in some other

circumstance we could have just said go

concurrent mutex oh I concur mutex is

the name of the function we actually

want to call with this URL but we want

to do a few other things as well so we

define this little helper function that

first calls concurrent mutex for us with

the URL and then after them current

mutex is finished we do something

special in order to help us wait for all

the crawls to be done before the outer

function returns so that brings us to

the the weight group the weight group at

line 55 it's a just a data structure to

find by go to help with coordination and

the game with weight group is that

internally it has a counter and you call

weight group dot add like a line 57 to

increment the counter and we group done

to decrement it and then this weight

what this weight method called line 63

waits for the counter to get down to

zero so a weight group is a way to wait

for a specific number of things to

finish and it's useful in a bunch of

different situations here we're using it

to wait for the last go routine to

finish

because we add one to the weight group

for every go routine we create line 60

at the end of this function we've

declared decrement the counter in the

weight group and then line three weights

until all the decrements have finished

and so the reason why we declared this

little function was basically to be able

to both call concurrently text and call

dot that's really why we needed that

function so the question is what if one

of the subroutines fails and doesn't

reach the done line that's a darn good

question there is you know if I forget

the exact range of errors that will

cause the go routine to fail without

causing the program to feel maybe

divides by zero I don't know where

dereference is a nil pointer

not sure but there are certainly ways

for a function to fail and I have the go

routine die without having the program

die and that would be a problem for us

and so really the white right way to I'm

sure you had this in mind and asking the

question the right way to write this to

be sure that the done call is made no

matter why this guru team is finishing

would be to put a defer here which means

call done before the surrounding

function finishes and always call it no

matter why the surrounding function is

finished yes

and yes yeah so the question is how come

two users have done in different threads

aren't a race yeah so the answer must be

that internally dot a weight group has a

mutex or something like it that each of

Dunn's methods acquires before doing

anything else so that simultaneously

calls to a done to await groups methods

are trees we could to did a low class

yeah for certain leaf C++ and in C you

want to look at something called P

threads for C threads come in a library

they're not really part of the language

called P threads which they have these

are extremely traditional and ancient

primitives that all languages yeah

say it again you know not in this code

but you know you could imagine a use of

weight groups I mean weight groups just

count stuff and yeah yeah yeah weight

group doesn't really care what you're

pounding or why I mean you know this is

the most common way to see it use you're

wondering why you is passed as a

parameter to the function at 58 okay

yeah this is alright so the question is

okay so actually backing up a little bit

the rules for these for a function like

the one I'm defining on 58 is that if

the function body mentions a variable

that's declared in the outer function

but not shadowed then the the inner

functions use of that is the same

variable in the inner function as in the

outer function and so that's what's

happening with Fechter for example like

what is this variable here refer to what

does the Fechter variable refer to in

the inner function well it refers it's

the same variable as as the fetcher in

the outer function says just is that

variable and so when the inner function

refers to fetcher it just means it's

just referring the same variable as this

one here and the same with F f is it's

used here it's just is this variable so

you might think that we could get rid of

the this u argument that we're passing

and just have the inner function take no

arguments at all but just use the U that

was defined up on line 56 in the loop

and it'll be nice if we could do that

because save us some typing it turns out

not to work and the reason is that the

semantics of go of the for loop at line

56 is that the

for the updates the variable you so in

the first iteration of the for loop that

variable u contains some URL and when

you enter the second iteration before

the that variable this contents are

changed to be the second URL and that

means that the first go routine that we

launched that's just looking at the

outer if it we're looking at the outer

functions u variable the that first go

team we launched would see a different

value in the u variable after the outer

function it updated it and sometimes

that's actually what you want so for

example for for F and then particular F

dot fetched we interaction absolutely

wants to see changes to that map but for

you we don't want to see changes the

first go routine we spawn should read

the first URL not the second URL so we

want that go routine to have a copy you

have its own private copy of the URL and

you know is we could have done it in

other ways we could have but the way

this code happens to do it to produce

the copy private to that inner function

is by passing the URLs in argument yes

yeah if we have passed the address of

you yeah then it uh it's actually I

don't know how strings work but it is

absolutely giving you your own private

copy of the variable you get your own

copy of the variable and it yeah

are you saying we don't need to play

this trick in the code we definitely

need to play this trick in the code and

what's going on is this it's so the

question is Oh strings are immutable

strings are immutable right yeah so how

kind of strings are immutable how can

the outer function change the string

there should be no problem the problem

is not that the string is changed the

problem is that the variable U is

changed so the when the inner function

mentions a variable that's defined in

the outer function it's referring to

that variable and the variables current

value so when you if you have a string

variable that has has a in it and then

you assign B to that string variable

you're not over writing the string

you're changing the variable to point to

a different string and and because the

for loop changes the U variable to point

to a different string you know that

change to you would be visible inside

the inner function and therefore the

inner function needs its own copy of the

variable

essentially make a copy of that so that

okay but that is what we're doing in

this code and that's that is why this

code works okay

the proposal or the broken code that

we're not using here I will show you the

broken code

this is just like a horrible detail but

it is unfortunately one that you'll run

into while doing the labs so you should

be at least where that there's a problem

and when you run into it maybe you can

try to figure out the details okay

that's a great question so so the

question is you know if you have an

inner function just a repeated if you

have an inner function that refers to a

variable in the surrounding function but

the surrounding function returns what is

the inner functions variable referring

to anymore since the outer function is

as returned and the answer is that go

notices go analyzes your inner functions

or these are called closures go analyzes

them the compiler analyze them says aha

oh this disclosure this inner function

is using a variable in the outer

function we're actually gonna and the

compiler will allocate heat memory to

hold the variable the you know the

current value of the variable and both

functions will refer to that that little

area heap that has the barrel so it

won't be allocated the variable won't be

on the stack as you might expect it's

moved to the heap if if the compiler

sees that it's using a closure and then

when the outer function returns the

object is still there in the heap the

inner function can still get at it and

then the garbage collector is

responsible for noticing that the last

function to refer to this little piece

of heat that's exited returned and to

free it only then okay okay

okay so wait group wait group is maybe

the more important thing here that the

technique that this code uses to wait

for all the all this level of crawls to

finished all its direct chill and the

finish is the wait group of course

there's many of these wait groups one

per call two concurrent mutex each call

that concurrent mutex just waits for its

own children to finish and then returns

okay so back to the lock actually

there's one more thing I want to talk

about with a lock and that is to explore

what would happen if we hadn't locked

right I'm claiming oh you know you don't

lock you're gonna get these races you're

gonna get incorrect execution whatever

let's give it a shot I'm gonna I'm gonna

comment out the locks and the question

is what happens if I run the code with

no locks what am I gonna see so we may

see a ru or I'll call twice or I fetch

twice yeah that's yeah that would be the

error you might expect alright so I'll

run it without locks and we're looking

at the concurrent map the one in the

middle this time it doesn't seem to have

fetched anything twice it's only five

run again gosh so far genius so maybe

we're wasting our time with those locks

yeah never seems to go wrong I've

actually never seem to go wrong so the

code is nevertheless wrong and someday

it will fail okay the problem is that

you know this is only a couple of

instructions here and so the chances of

these two threads which are maybe

hundreds of instructions happening to

stumble on this you know the same couple

of instructions at the same time is

quite low and indeed and and this is a

real bummer about buggy code with races

is that it usually works just fine but

it probably won't work when the customer

runs it on their computer

so it's actually bad news for us right

what do we you know it it can be in

complex programs quite difficult to

figure out if you have a race right and

you might you may have code that just

looks completely reasonable that is in

fact sort of unknown to you using shared

variables and the answer is you really

the only way to find races in practice

to be is you automated tools and luckily

go actually gives us this pretty good

race detector built-in to go and you

should use it so if you pass the - race

flag when you have to get your go

program and run this race detector which

well I'll run the race detector and

we'll see so it emits an error message

from us it's found a race and it

actually tells us exactly where the race

happened so there's a lot of junk in

this output but the really critical

thing is that the race detector realize

that we had read a variable that's what

this read is that was previously written

and there was no intervening release and

acquire of a lock that's what that's

what this means furthermore it tells us

the line number so it's told us that the

read was a line 43 and the write the

previous write was at line 44 and indeed

we look at the code and the read isn't

line 43 and the right is at lying 44 so

that means that one thread did a write

at line 44 and then without any

intervening lock and another thread came

along and read that written data at line

43 that's basically what the race

detector is looking for the way it works

internally is it allocates sort of

shadow memory now lucky some you know it

uses a huge amount of memory and

basically for every one of your memory

locations the race detector is allocated

a little bit of memory itself in which

it keeps track of which threads recently

read or wrote every single memory

location and then when and it also to

keep tracking keeping track of when

threads acquiring release locks and do

other synchronization activities that it

knows forces but force threads to not

run

and if the race detector driver sees a

ha there was a memory location that was

written and then read with no

intervening market it'll raise an error

yes I believe it is not perfect yeah I

have to think about it what one

certainly one way it is not perfect is

that if you if you don't execute some

code the race detector doesn't know

anything about it so it's not analyzing

it's not doing static analysis the

racing sectors not looking at your

source and making decisions based on the

source it's sort of watching what

happened at on this particular run of

the program and so if this particular

run of the program didn't execute some

code that happens to read or write

shared data then the race detector will

never know and there could be erased

there so that's certainly something to

watch out for so you know if you're

serious about the race detector you need

to set up sort of testing apparatus that

tries to make sure all all the code is

executed but it's it's it's very good

and you just have to use it for your 8

to 4 lives okay so this is race here and

of course the race didn't actually occur

what the race editor did not see was the

actual interleaving simultaneous

execution of some sensitive code right

it didn't see two threads literally

execute lines 43 and 44 at the same time

and as we know from having run the

things by hand that apparently doesn't

happen only with low probability all it

saw was at one point that was a right

and they made me much later there was a

read with no intervening walk and so

enact in that sense it can sort of

detect races that didn't actually happen

or didn't really cause bugs okay

okay one final question about this this

crawler how many threads does it create

yeah and how many concurrent threads

could there be yeah so a defect in this

crawler is that there's no obvious bound

on the number of simultaneous threads

that might create you know with the test

case which only has five URLs big

whoopee but if you're crawling a real

wheel web with you know I don't know are

there billions of URLs out there maybe

not we certainly don't want to be in a

position where the crawler might

accidentally create billions of threads

because you know thousands of threads

it's just fine billions of threads it's

not okay because each one sits on some

amount of memory so a you know there's

probably many defects in real life for

this crawler but one at the level we're

talking about is that it does create too

many threads and really ought to have a

way of saying well you can create 20

threads or 100 threads or a thousand

threads but no more so one way to do

that would be to pre create a pool a

fixed size pool of workers and have the

workers just iteratively look for

another URL to crawl crawl that URL

rather than creating a new thread for

each URL okay so next up I want to talk

about a another crawler that's

implemented and a significantly

different way using channels instead of

shared memory it's a member on the mutex

call or I just said there is this table

of URLs that are called that's shared

between all the threads and asked me

locked this version does not have such a

table does not share memory and does not

need to use locks okay so this one the

instead there's basically a master

thread that's his master function on a

decent 986 and it has a table but the

table is private to the master function

and what the master function is doing is

instead of sort of basically creating a

tree of functions that corresponds to

the exploration of the graph which the

previous crawler did this one fires off

one ute one guru team per URL that it's

fetches and that but it's only the

master only the one master that's

creating these threads so we don't have

a tree of functions creating threads we

just have the one master okay so it

creates its own private map a line 88

this record what it's fetched and then

it also creates a channel just a single

channel that all of its worker threads

are going to talk to and the idea is

that it's gonna fire up a worker thread

and each worker thread that it fires up

when it finished such as fetching the

page will send exactly one item back to

the master on the channel and that item

will be a list of the URLs in the page

that that worker thread fetched so the

master sits in a loop we're in line

eighty nine is reading entries from the

channel and so we have to imagine that

it's started up some workers in advance

and now it's reading the information the

URL lists that those workers send back

and each time he gets a URL is sitting

on land eighty nine it then loops over

the URLs in that URL list from a single

page fetch align ninety and if the URL

hasn't already been fetched it fires off

a new worker at line 94 to fetch that

URL and if we look at the worker code

online starting line 77 basically calls

his fetcher and then sends a message on

the channel a line 80 or 82 saying

here's the URLs in the page they fetched

and notice that now that the maybe

interesting thing about this is that the

worker threads don't share any objects

there's no shared object between the

workers and the master so we don't have

to worry about locking we don't have to

worry about rhesus instead this is a

example of sort of communicating

information instead of getting at it

through shared memory yes

yeah yeah so the observation is that the

code appears but the workers are the

observation is the workers are modifying

ch while the Masters reading it and

that's not the way the go authors would

like you to think about this the way

they want you to think about this is

that CH is a channel and the channel has

send and receive operations and the

workers are sending on the channel while

the master receives on the channel and

that's perfectly legal the channel is

happy I mean what that really means is

that the internal implementation of

channel has a mutex in it and the

channel operations are careful to take

out the mutex when they're messing with

the channels internal data to ensure

that it doesn't actually have any

reasons in it but yeah channels are sort

of protected against concurrency and

you're allowed to use them concurrently

from different threads yes

over the channel receive yes

we don't need to close the channel I

mean okay the the break statement is

about when the crawl has completely

finished and we fetched every single URL

right because hey what's going on is the

master is keeping I mean this n value is

private value and a master every time it

fires off a worker at increments the end

though every worker it starts since

exactly one item on the channel and so

every time the master reads an item off

the channel it knows that one of his

workers is finished and when the number

of outstanding workers goes to zero then

we're done and we don't once the number

of outstanding workers goes to zero then

the only reference to the channel is

from the master or from oh really from

the code that calls the master and so

the garbage collector will very soon see

that the channel has no references to it

and will free the channel so in this

case sometimes you need to close

channels but actually I rarely have to

close channels

he said again

so the question is alright so you can

see at line 106 before calling master

concurrent channel sort of fires up one

shoves one URL into the channel and it's

to sort of get the whole thing started

because the code for master was written

you know the master goes right into

reading from the channel line 89 so

there better be something in the channel

otherwise line 89 would block forever so

if it weren't for that little code at

line 107 the for loop at 89 would block

reading from the channel forever and

this code wouldn't work well yeah so the

observation is gosh you know wouldn't it

be nice to be able to write code that

would be able to notice if there's

nothing waiting on the channel and you

can if you look up the Select statement

it's much more complicated than this but

there is the Select statement which

allows you to proceed to not block if

something if there's nothing waiting on

the channel

because the work resin finish okay sorry

to the first question is there I think

what you're really worried about is

whether we're actually able to launch

parallel so the very first step won't be

in parallel because there's an exit

owner the for-loop weights in at line 89

that's not okay that for loop at line 89

is does not just loop over the current

contents of the channel and then quit

that is the for loop at 89 is going to

read it may never exit but it's gonna

read it's just going to keep waiting

until something shows up in the channel

so if you don't hit the break at line 99

the for loop own exit yeah alright I'm

afraid we're out of time we'll continue

this actually we have a presentation

scheduled by the TAS which I'll talk

more about go

Loading...

Loading video analysis...