Summary of Yuandong Tian: Inside-out interpretability: training dynamics in multi-layer transformer

The video discusses the training dynamics in multi-layer transformers, focusing on the attention mechanism and its application in various scenarios. The main concepts and findings discussed include:

Researchers or sources featured

Notable Quotes

40:06 — « finally, the reason for using the Attention sync is simply avoiding the limit of the context windows. »
49:21 — « Using this Sly, in many different scenarios, our streaming RMS can give you very stable proximity, which means that the output is very meaningful right, and at the same time, it saves memory. »
49:50 — « You can actually do even a little bit better right, so you have a dedicated token called Sync tokens at the beginning and you also keep the Sync token either as zero inviting or learnable inviting. »
54:22 — « If you do interpolation it behaves quite well, but if you do extrapolation and all of a sudden it gives you super bad scores out of the window that is being trained out, thats actually the reason why we should do interpolation rather than extrapolations. »
63:13 — « Thats actually not the right thing to do even if it will make the mathematics much easier, this basically kills the entire physics picture, thats not the good thing to do. »

Category

Science and Nature

Video