ߣ010-66095089/13521200337΢ţ

  • ҵ¹ͬԱֲ

    ҵ¹ͬԱֲ

  • ܻ㣨һڣ

    ܻ㣨һڣ

  • ڿ

    ڿ

ҵѶ

֪ͨ

ǰλãҳ>>֪ͨ

Ϊʲô AI оƬʱȻTPUʼļʮ֮

ʱ䣺2018-9-26 9:53:03    95
0

  

 

ĦɵսὫʹضļܹΪδһԵӾǹȸ 2015 ƳԪTPUĿǰѾΪʮṩTPU ʹ磨DNNļٶ 15-30 ܺЧʱƼµĵ CPU GPU ߳ 30-80

 

Ĺ۵

  • Ȼ TPU һרüɵ·еij TensorFlow µ磬˹ȸĵҪӦãͼʶ𡢷롢Ϸ

  • ͨרΪ·оƬԴTPU ʵĸػЧҪͨ͵ļ 30-80 ĿǰΪȫ 10 ṩճ

  • ׶ͨϸӦʱҪ⽵ͨüʹüЧԣͨüͨеýϿ죬ijЩҲ

 

ָнβ

 

1965 ꣬Ӣضʼ֮һ Gordon Moore Ԥ⣬оƬеľÿһ궼һ 2017 1 µ ACM ͨѶ“һսֱ”Ħȷʵˡ2014 Ƴ DRAM оƬ 80 ڸܣǵԤмʹ 2019 ꣬ 160 ڸܵ DRAM оƬҲģĦɵԤ⣬ᄃӦñıࡣ2010 Ӣضǿ E5 ӵ 23 ڸܣ 2016 ǿ E5 Ҳֻ 72 ڸܣ˵ĦԤƵֵ 2.5 ——Ȼʹ뵼ҵڽŲʮֻ

 

Dennard Scaling һΪ֪ͬҪĹ۲Robert Dennard 1974 ΪܲϱСоƬĹܶȲ䡣ܳߴСôͬоƬоܵͻΪ 4 ͬʱ͵ѹ˶֮һʹõĹʽ½ 4 ͬƵṩͬĹʡDennard Scaling ڱֵ 30 ԭ򲢲Ϊܵijߴ粻СΪ͵ѹڼ½ͬʱֿɿˡ

 

ܹʦһֱƶĦɺ Dennard ScalingͨӵĴƺڴνṹԴתΪܣЩ˲ͬĴָ֮IJԣܹʦDzûʲô̵֪ʶε=(´ο*))) ǼܹʦջҲҲµİ취ָIJԡDennard Scaling 2004 սˣеǼָԵķҲʼȱϯʹҵֻܴӵ˸ߺܴת˸Чʴ

 

Ȼؼ·ķIBM ʦķɵߣ 1967 ĶɣöΪӴᵼĵݼķ˵мۼܵ˳򲿷ֵƣ 1/8 ǴеģҲֻԭʼܸ 8 ——ʹಿֺײУҼܹʦ 100 Ҳˡ

 

ͼʾ˹ȥʮɶԴܵӰ졣Ŀǰٶȣ׼ 2038 ǰٴη

 

ͼ 1. HennessyPattersonۣǻ˹ȥ40У32λ64λںÿߵSPECCPUintܣSPECCPUint_rateӳƵʹƽӳ˼

 

ܿûʲôռˣⷴӳĦɵĽÿƽоƬĹӣϾ Dennard Scaling ҲˣǶоƬĹԤȴûӣΪӵƶеͷƣоƬʦѾڳַӶĵˣҲܵķɵƣܹʦձΪĽܡ۸ԴƽΨһ;ضļܹ——ֻڴض񣬵Чʷdzߡ

 

ƶ˵Ĵݼʹ֮Эʹûѧϰ˺ܴĽر磨DNN档һЩͬDNN Ӧ÷ʽܹ㷺DNN ͻԽչʶĵʴϷ 30%ǽ 20 ĵ 2011 ͼʶĴʴ 26% 3.5%ΧϻھĽȵȡ DNN ܹӦ÷ΧխַȻ˴ʵӦá

 

񾭵Ĺܣڼ򵥵˹ԪļȨ͵ķԺ max(0,value)Щ˹ԪΪ಻ͬIJ㣬һͻΪһ롣е“”ж㣬мݼֻ軻ø󡢸IJͿԹ׼ȷģͣҲͿԻȡ߽׳ģʽǸ GPU ṩ㹻ļǿ DNN

 

DNN е׶αΪѵѧϰԤ⣩ֱָѵһ DNN Ҫ죬ѵõ DNN ֻҪ롣ԲͬӦãҪѡͺͲѵ̻еȨءеѵǸ㣬 GPU ѧϰʱܻӭԭ֮һ

 

Ϊ“”תתΪֻͣͨ 8 λ——̶ֳͨҪˡ IEEE 754 16 λ˷ȣ8 λͳ˷ֻҪ 1/6 ܺģҪĿռҲֻҪ 1/6תΪͼӷҲܴ 13 ܺĺ 38 ռ

 

±չʾӣࡢ DNN ʵЧ——˹ȸ 2016 95% أǰΪ benchmark TensorFlow еĴ붼dẓֻ 100-1500 СЩнϴӦóеСЩӦÿܻǧ C++빹ɡӦóͨûͶӦʱϸҪ

 

1. 20167£ȸTPUϵDNNءDNNӦãDNNͣ95%TPU

 

1 ʾÿģҪ 500 1 ڸȨأֻǷЩȨضҪѴʱܺġΪ˽ͷʳɱѵлһȫɵʹͬȨأܡ

 

TPU Դܹʵ

 

2006 ꣬ȸͿʼIJ GPUͼδ FPGA field programmable gate arrayֳɱУרüɵ·ASICʱĽǣרŵӲеӦò࣬ʹùȸĵĸɣԿѵģҲѲôǮʹܴĽ 2013 ˱仯ʱȸûÿʹʶ DNN ӣʹùȸĵļһʹôͳ CPU dzˣȸһȼdzߵĿһоƬֳɵ GPU ѵȸĿǽԼ۱ 10 Ϊȸơ֤ TPUһ̽ 15 ¡

 

Ϊ˽ͲƳٵķգȸ蹤ʦ TPU Ƴ I/O ϵЭ GPU һзʹ TPU CPU ܼɡ⣬Ϊ˼Ӳƺ͵ԣָ͸ TPU ִУ TPU ȥлȡˣTPU ϸӽ㵥ԪFPUЭ GPU

 

ȸ蹤ʦ TPU һϵͳŻΪ˼ CPU ĽTPU ģͣṩ 2015 꼰Ժ DNN ƥԣ 2013 DNN

 

TPU ָͨ PCI-e Gen3x16 ߴ͵ָСڲģͨͨ 256 ֽڿȵ·һ𡣶ҲоƬͼϽǿʼ˷Ԫ TPU ĺģ256×256 MAC ԶзŻ޷ŵִ 8 λ˷ͼӷ㡣õ 16 λĽ㵥Ԫ·Ĺ 4MB ռ 32 λۼСĸ MiB 4,096 256 Ԫص 32 λۼ㵥Ԫÿڲһ 256 ԪصIJֺ͡

 

ͼ 2. TPUܹʾͼоƬͼ

 

ԪȨͨƬڡȨ FIFOּ FIFO dz֮ΪȨش洢Ƭ 8GB DRAM жȡȨؽȡ8GB ռ֧ͬʱģ͡Ȩ FIFO IJм 24MiB ƬϡͳһУΪ㵥Ԫ롣ɱ DMA z zCPU 洢ͳһ䴫ݡΪڹȸĹģϿɿزڴ洢洢ôӲ

 

TPU ΢ܹԭDZ־㵥ԪһֱڹСΪʵһĿ꣬ȡȨصָѭ/ִԭڷȨصĵַ֮󡢵Ȩش洢ȡȨ֮ǰһָ뼤Ȩû׼ã㵥Ԫֹͣ

 

ڶȡ;̬ȡ洢Static Random-Access MemorySRAMĵԴѧö࣬ԪʹáִСͨͳһĶдԼԲͬԹɵļеĵԪȻϼ㡣һ 65,536 Ԫص-˷ΪԽDzǰھƶЩȨرԤأݿĵһһǰЧƺݱˮ߻Աһִ· 256 һζȡģ 256 ۼÿۼһλáȷԵĽǶ˽ԪԣΪܣ뿼ǵԪӳ١

 

TPU ջΪ CPU GPU ջݣӦÿԿֲ TPU ϡ TPU еӦòͨ TensorFlow дɿ GPU TPU е API

 

CPUGPUTPU ƽ̨

 

ϵܹоĻСֲ͡Ļ׼ģ⣬Щ׼ԿԤDZڵܣʵֵĻIJͬǶ 2015 ʵصĻлعһЩճû 10 ڡ 1 гӦô 2016 TPU 95% ʹá

 

ڲĸأҪȽϵĻ׼ƽ̨ Google IJΪļ㸺Ҳֻ㡣ȸĵĺܶ͹ȸģӦóɿԵҪζŻټڴӢΰ Maxwell GPU Pascal P40 GPU ڲ洢еĴԹȸĹģЩͬʱȸӦóϸɿҪġ

 

2 ʾ˲ڹȸĵķԽ TPU бȽϡͳ CPU Ӣض 18 ˫ Haswell Ϊƽ̨Ҳ GPU TPU ȸ蹤ʦڷʹĸ TPU оƬ

 

2. ׼ԵķʹHaswell CPUK80 GPUTPU

 

ܹܶʦûпǵӲƷоƬ塢Էĵû֮ʱ3ָ2014 2017 䣬ҵƹ˾ GPU ʱΪ 5 25 ¡ˣʱʺ 2015 Ͷʹõ TPU Աȵ GPU ȻӢΰ K80ߴͬİ뵼幤У TPU ǰ·

 

3.2015굽2017꣬ӢΰGPUӷƶ˲ʱ࣬4GPUֱΪKeplerMaxwellPascalVoltaܹ

 

ÿ K80 Ƭṩڲ洢 DRAM Ĵ;ܡַɰװ 8 K80 ƬҲǻ׼ԵáCPU GPU ʹôоƬԼ 600 mm2оƬԼӢض Core i7

 

ܣRooflineӦʱ䡢

 

Ϊ˵Ӧദϵܣʹ˸ܼȺHPC Roofline ģ͡һ򵥵ӾģȻʾƿֵԭ򡣸ģͱļӦļ޷ȫװƬϻ棬ҪôޣҪôڴޡ HPCy ʾÿ븡ܣλ FLOPS˷ֵγ roofline ͼ“ƽ”x ʾܶȣoperational intensityλ FLOPS/byteڴʾÿٱأγ roofline “б”֣Ϊ (FLOPS/sec)/ (FLOPS/Byte) = Bytes/secûгܶʱڴֻ roofline “б”֮¡

 

һӦõÿʵʼ컨֮ľ˱ܶȲʱDZܶȵŻ绺ֿ飩ܴ

 

Ϊ TPU ʹ Roofline ģͣDNNӦñʱȽ滻㡣 DNN ӦõȨͨ޷װƬڴ棬˵ڶıǽܶ¶Ϊÿȡ 1 ֽڵȨֵʱ˻ۼĿ1ʾ

 

ͼ 3 չʾ˵ TPUCPU GPU Ƭ Roofline ģָ꣬ͼָꡣTPU Roofline ģ“б”ֽϳӼܶȽǶȿǣζܸڴǼֵӦӽ컨壬MLP LSTM ڴ棬CNN ڼ㡣

 

ͼ 3TPUCPUGPU Roofline ģϵܱ֡DZʾTPUαʾӢΰTesla K80 GPUԲʾӢضHaswellTPUǶоƬ Roofline ģͳƽں֮

 

ͼ 3 ʾ Haswell Լ K80 ϣ DNN Ӧþ Roofline 컨ľ붼 TPU ϵľ롣ӦʱԭڡЩӦеĺܶնûһ֡оʹӦʱЩ΢Ӷᵼûʹø÷Ƶʽ͡ѵûп̵Ӧʱ deadlineͨУ˵ѡƫõӳ١

 

磬ӦÿߵҪMLP0 ӦҪ 99% ʱӦʱ䲻 7msÿĿҪ 7ms ӳҪаʱͼʱ䣩ӦʱƷſһЩHaswell K80 ʱҲֱֻܴﵽ MLP0 Ӧ 42% 37%ЩҲӰ쵽TPU 80% Уӽӽ TPU MLP0 CPU GPU ȣ߳ TPU ûиӵ΢ܹҲͲҪΪƽ״ľܺ͹ܣֱ 99% ״ TPU ûл桢֧Ԥ⡢ִСദƲԤַϲ̡߳еȵȡרôơ

 

4 ʾÿоƬܵߣ vs. CPU չʾ DNN ӦܵļȨƽֵ K80 оƬٶ Haswell 1.9 TPU оƬٶ Haswell 29.2 TPU оƬٶҲ GPU 15.3

 

4 DNN K80 GPU оƬ TPU оƬ CPU ܱȡȨƽֵʹ˱ 1 app ʵʻ

 

Լ۱ȡȫڳɱTCOܣ

 

ʱؿԼ۱ȶܡõijɱָȫڳɱTCOһ֯ȸ裩ǧоƬʱ֧ʵʼ۸ȡҵ̸йͨ漰ҵܣ޷۸Ϣݡǣ TCO أǿԹÿ̨ĹʹܣıȴܣTCO ȡⲿǶԱǵоƬ

 

ͼ 4 ʾ K80 GPU TPU Haswell CPU ƽ/ıȡǸֲͬ/ıȼ㡣չʾֲͬܣļ㷽һ“total” GPU TPU ܣʱ CPU ĵĹʣڶ“incremental” GPU TPU мȥ CPU ĵĹʡ

 

ͼ 4GPU CPU /ĶԱȣɫTPU CPU ĹĶԱȣɫTPU GPU ĹĶԱȣɫTPU' ʹ K80 GDDR5 ڴ潫 TPU иĽоƬ

 

/أK80 Haswell 2.1 /أ Haswell ԴʱK80 Haswell 2.9 TPU /ر Haswell Ҫ߳34TPU /Ҳ K80 16 TPU ˵ CPU /Թȸ ASIC ˮƽѾ 83 ͬʱҲ GPU /ص29

 

TPU Ƶ

 

FPU ƣTPU ЭͬȽΪӦôһģ͡ģģĽӲʵܵƽС 10%

 

ʹģһ TPU оƬΪ TPU'ʹͬİ뵼弼ٶ໨ 15 ǾƳоƬ߼ϳɺģƿ԰ʱƵ 50%Ϊ GDDR5 ڴƽӿڵ· K80 ԽȨضȡڴı roofline бƽĹյ 1350 250

 

ʱƵߵ 1,050 MHz ڴĻdzޡǰʱƵʱ 700MHz GDDR5 Ϊڴ棬ȨƽֵԾ 3.9ͬʱִʩʵıܣ˼ TPU' ֻ߱ڴ͹ˡ

 

DDR3 Ȩڴ滻 K80 ͬ GDDR5 ڴҪڴͨһ 4 ͨһĽҪоƬԼ 10%GDDR5 Ҫ TPU ϵͳĹԤ 861W ߵ 900W ңΪÿ̨ĸ TPU

 

ͼ4ɿTPU'ÿƬܣ Haswell 86 K80 41 incremental ָ Haswell 196 K80 68

 

 

ⲿְ Hennessy Patterson --ʽչ

 

ĵ DNN ƶӦýӦʱ䴦ͬҪĵλ

 

Ǿڹȸ TPU ߶ӦʱôߵҪ2014 ͸¶˵ǣ TPU ˵СӦ㹻Ե TPU ļֵܷӳٵҪô̡һƶõӦͼȸ迪ߵֱǣʽҲҪ TPU󲿷ַҪۻ㹻ȻŽ TPU 㡣ʹ 2014 ȸעӦʱ䣨LSTM1Ӧÿ߳ƣ2014 10ms TPU ֲʱ 10ms 7msܶ TPU ֮ҪԼǶԿӦʱӰƫãıʽӦñдͨѡ񽵵ӳ٣ۻ˵ǣTPU ߱һ򵥡ظִģͣ㽻ʽĵӦʱԼ߷ֵҪ󣬼ǼԽСʱҲȵǰ CPU GPU ߱õܡ

 

K80 GPU ܹʺϽ DNN

 

Ƿ TPU ܡܺĺͳɱ K80 GPU ԭ1TPU ֻһ K80 13 ̵߳ȻϸӳĿꡣ2TPU ߱һdzĶά˷ԪGPU 13 Сһά˷ԪDNN ľܶʺ϶άе߼㵥Ԫ3άлͨʼĴٳʽоƬʵ֣ԼԴ4TPU Ӧʹ 8 λͣ GPU 32 λ㣻 K80 ֧ 8 λ͡ʹøСݣƵIJǼܺģıȨ FIFO ЧȨڴЧʱʹõ 8 λͣѵЩӦʱᱣ֤ʹøһ׼ȷʣ5TPU GPU Ҫ DNN ҪӶС TPU оƬԼܺġΪĽ¿ռ䡣TPU оƬĴС K80 һ룬ͨǺߵ֮һڴȴǺߵ 3.5 ص TPU ܺĺܷ K80 GPU 30

 

壺רüܹʱ˼ܹʷ

 

ͨü뷨ʺרüܹ TPU ԣҪļܹ׷ݵ 1980 ڣУsystolic array/ִУdecoupled access/execute͸ӵָһ˴;˵Ԫܺģڶھ˵Ԫڼ䲢лȡȨأõ PCIe bus ޴ָˣԼܹʷȽ˽רüܹʦ߱ơ

 

ȸ CPU ʹøӸЧõĽ TPU

 

Чʹ CPU ĸ߼չAVX2ͼЧ֧Ҫ CPU ֻһ DNN 8 λ͵ܲԽԼ 3.5 е CPU ڸļչʾȷҲռ̫ͼռ䣩ҲûиͼƵ Roofline ͼ DNN ܹõƵļ٣TPU /ıֱ 41 83 Ϊ 12 24

 

ȸʹúʵ°汾GPU TPU ࡣ

 

3 ˷ GPU Ϳͻʱʹ GPU µ GPU йƽȽϽµ TPUڶӵ 10W ģֻʹ K80 GDDR5 ڴͿԽ 28 ס0.7GHz40W TPU TPU ƶ 16 ׹սһ/ġ16 Ӣΰ Pascal P40 GPU ķֵǵһTPU һ룬 250 ߵܺȴԭĺܶ౶ǰȱζ Google ޷ȥIJ P40޷ȷʵܡ

 

о

 

ƪ DNN ASIC оٿ׷ݵ 20 90 2016 ACM ͨѶDianNao DNN ܹͨ DNN ӦóеڴģʽṩЧļܹ֧֣޶ȵؼƬϺⲿ DRAM ڴʡ DianNao ʹ 64 16 λ˷ۼӵԪС

ȸ TPU 3.0 ڽ 5 Ƴ书 TPU 2.0 İ˱ܸߴ 100petaflopsоƬʹҺȴΪ DNN ضܹȻǼܹʦŻ⣬һϡļܹ 2015 TPU ״β֮ġ

 

Efficient Inference EngineЧһijʼɨ裬ȥdzСֵȨٵ 1/10 ңȻʹùһСܡCnvlutin ȥ˼Ϊʱij˷㣬ֵּļʿԴﵽ 44%ԭǷԱ任 ReLU ĸֵתΪ㣻ȥʹƽ 1.4

 

Eyeriss һӱĵ͹ܹͨγ̱еڴռãͨΪʱļʡܺġMinerva һֿ㷨ṹ͵·ѧƵЭͬϵͳͨԼԽСļ֦ķʽѹĽ͵ԭ 1/8ϵͳ 2017 չijɹ SCNN——һϡѹļȨغͼѹ DRAM ڲУӶݴʱоƬ洢ģ͡

 

2016 һƣѵضܹ磬ScaleDeep Ϊ DNN ѵƵĸܷһγԣǧеÿоƬ 31 ıרģʹ洢רģ飬 GPU 6-28 ֧ 16 λ 32 λ㡣оƬͨ DNN ͨģʽƥĸܻӡ SCNN һֻ CNN 2016 ꣬CNN ռȸ TPU 5%ܹʦڴ͵ DNN Ӳʵֽ ScaleDeep

 

DNN ƺ FPGA Ϊļƽ̨һʵʲһ Catapult Catapult 2014 깫ģ TPU ͬһʱģΪ 2015 ΢IJ 28 Stratix V FPGAȸ貿 TPU ͬһʱ䡣Catapult CNN ͨ 2.3 Ҳ Catapult TPU ڣΪ˻ܣûʹõͼӲ Verilog Ϊ FPGA д򣬶ʹø߼ TensorFlow ܱд̳Ҳ˵“ٱ”re-programmability TPU FPGA Ĺ̼

 

ܽ

 

TPU I/O ϣڴȫЧܣ DNN Ӧóĸڴ棩һܴ𵽺ϸÿ 65536 γ˷ۼӼ㣬ȻһԽϴ֣ roofline ģʾAmdahl ɵʵһdzмֵ——ԴĵЧȻṩľԼ۱ȵĸܡ

 

˽⵽ӦþϸӦʱƣΪͨûӦãˣΪ DNN ƼоƬʦҪ֤ 99% µʱҪ

 

TPU оƬ MAC Ƭڴʹض TensorFlow ܱдĶ̳򣬸 TensorFlow ܱ K80 GPU оƬ 15 ܻ 29 /ƣ/ӵгɱء Haswell CPU оƬȣӦıʷֱΪ 29 83

ܹؿԽࣺܲ

 

  • TPUֻһK8013CPU18߳ʹϵͳױڹ̶ӳڡ

  • Ͷά˷ԪTPUһdzĶά˷ԪCPUGPUֱֻ1813Сһά˷ԪάӲھ˷кܺõܡ

  • Сά֧֯УټĴʺġ

  • 8λ͡TPUӦʹ 8 λͶ 32 λ߼ڴЧʡ

  • TPU CPU GPU Ҫ DNN òĹܣʹ TPU ˣͬʱԽԼԴܱضİڴ档

 

Ȼδ CPU GPU ʱٶȸ죬ʹ circa-2015 GPU ڴƵ TPU ԭʹ/ƷֱΪ K80 Haswell 70 200

 

ڹȥʮУܹоԱĴ³ɹģģЩɹʹ޵Ļ׼ͨôĸĽҲֻ 10% ٣ڱԭʮֹӦʵӦõʵӲвضܹ档

 

ҵƷ֮ڼܹкټܵ TPU Ϊδĵ䷶ԤƣҲ򣬲żøߡ

  

ҳ|Э|ǿ|Ա|Ⱥ|Ϳռ|Աλ|ϵ

CopyRight © 2016 ܼӦЭ Ȩ        ICP17049596

绰010-66095089        ַڴ45Ժ

վά 

΢Ŷά