My NLP/AI learning journey
A personal documentation of diving deep into AI - from NLP foundations to speech AI and beyond
Why I took this up
The idea behind learning AI was mainly to stay at the cutting edge of technology. The focus and development of LLMs at OpenAI has a fascinating beginning; it wasn't planned but emerged from experiments, and the early team took note and decided to go all in on it, thereby transforming these experiments into path-breaking paradigms. There's a thrill I feel in being able to stay at the cutting edge of something, and that meant going deeper.
This reminds me of a Hacker News post where the original poster complained that coding and software engineering had become boring due to LLMs doing most of the coding. One person replied that the whole point of abstraction was to make things boring and easier. We started with compilers, then programming languages, and now have English itself as the programming language. This was the point of technical advancement. If one wants to have "fun" in understanding how things work, one has to go deeper into the stack to truly understand.
Starting with NLP: Choosing the Right Path
My initial thought for learning NLP was to take a certified course on Coursera. However, when I heard that Andrej Karpathy was launching Eureka Labs, I was both super happy and in dread that there are so many courses already; another one to join the mix? I told myself I'd be damned if I waited for Eureka Labs to come out before starting to study. No more waiting for the perfect course.
I had to choose between a certified course versus Andrej's weirdly named YouTube series "Neural Networks: Zero to Hero". I decided to take a bet since I'd read from multiple blogs these videos being the best starting point. More importantly, I also wanted to learn what it takes to teach well. For a non-university professor to garner such a following purely through teaching, he must be doing something right.
The YouTube Journey
So I started with the YouTube videos series. The best part is that he starts with a Jupyter notebook and walks through the code and model building as needed. I had to watch the videos multiple timesâ2-3 times minimum to intuitively grasp each step of the process. Working through the code is essential to get a better sense of everything.
In fact, CS336, a Stanford advanced course on LLMs, was motivated by these YouTube videos. In a 4-hour long video which is the last lecture of the YouTube series, Andrej builds the entire GPT-2. It contains a lot of useful information on GPU parallelisation, training quantization, and inference optimization. By the end, you're pretty deep into the subject.
Key takeaways from the series:
- Definitely read the papers he suggests in the video commentsâthey help all concepts gel together
- After going through the videos, the papers become much easier to understand
- The questions he poses are brilliant; often the solution can be found in the next lecture
- Maximum benefit comes from implementing the attention mechanism yourself
Going Closer to the Metal
I particularly liked that he implements things closer to the metal rather than using libraries. Although this increases the chances of shooting yourself in the foot, it gives a better understanding and visualization of the matrix operations in attention. Having seen the videos, I now prefer this raw approach over using einsum operations.
Reading his code is also a great experienceâhe writes in a very Pythonic and elegant way, including only what's necessary. This helps build intuition on how to write correct code. With LLMs happily generating more code than necessary, it helps to read good code, comprehend its conciseness and precision, and then ask LLMs to replicate the same in your own work.
NanoGPT is a prime example of this philosophyâonly 300 lines of code but incredibly powerful, with many forks across the community.
Testing My Understanding: CS224N Assignments
Once I had worked through the code and seen all the videos, I wanted a bigger challengeâtaking up assignments from scratch without assistance (university-level work without a TA). So I tried CS224N assignments to test how well I understood the subject material.
Chip Huyen mentions CS224N as a drawn-out course on her blog, and she wasn't wrong. The assignments are long and time-consuming, but you get a real sense of accomplishment after finishing them.
The initial assignments focus on traditional NLP rather than LLMs. While they might not be directly useful right now, they definitely help you understand how we reached this point with LLMs. What were the shortcomings of previous approaches? Remember, until 2022, most people hadn't even heard of LLMs, and work was done using traditional NLP methods.
Note: The solutions to CS224N assignments can be found online now that the class has finished. Feel free to reach out for any queries or brainstorming on solutions. It helps to think through the structure and be clear on pseudocode. With Claude Sonnet as my TA, I was able to ensure my code was correct and optimized.
Moving to Speech AI
Once I had a foothold in NLP, it was time to learn about speech AI. My nationalistic side told me that if I was going to learn AI, it had to benefit Indiaâand that meant voice AI. If every farmer could have a Jarvis with them, what would that look like? That's the dream. Voice AI is undoubtedly the biggest opportunity for India's AI landscape.
I started reading about voice AI but was surprised by the lack of discourse on speech AI courses, as if the speech learning community doesn't want to reveal their secrets. After much searching, I found two good resources: the Hugging Face Audio Course and CS224S.
Learning Path for Speech AI
I started with the Hugging Face Audio Courseâit's a gentle introduction to the field. I realised I needed to learn signal processing to understand speech AI properly. This book is an excellent primer for signal processingânot overly technical, and if you remember high school math, you should be fine.
After reading this book, I was able to grasp the nuances of speech ML much better.
As part of CS224S (Spring 2025), I learned about:
- Phonetics
- Evolution from ASR to TTS to modern speech models
- Concatenative approaches to neural network approaches
- The final project involved multilingual ASR
Project Results: The goal was to reduce Word Error Rate (WER) on low-resource languages below a certain threshold. The assignment asked us to bring it below 30%. With my enhancements using data augmentation techniques and combining training data from multiple languages, I managed to reduce it from 55% to 36%. There are still other approaches I plan to implement to bring it even lower.
I even wrote to Professor Andrew Maas asking if we could get video recordings for the course because slides can be difficult to follow. Later, I learned they won't be releasing videos. Nonetheless, the assignments and slides were incredibly helpful.
I noticed many Discord servers throwing around information with multiple linksâfor someone planning to learn in a structured manner, this becomes difficult to track and sometimes demotivating. Hence, I chose the structured course approach to dip my toes into speech AI.
Next Steps and Future Learning
Geospatial AI
I've always been fascinated with geospatial data. While between jobs around 9 months back, I was reading up on geospatial and remote sensing topics. I realised this would be fun, so I dabbled a bit there. I'll need more knowledge in computer vision. This is currently in progress.
Reinforcement Learning
I haven't fully dived into RL yet. I started watching some lectures by David Silver and was blown awayâhe explains very well. But is this something I'd love to pursue? I'm not sure yet.
Random Reflections
On AI's Pace: AI seems to be moving at an astonishing pace. I was recently looking up "nano banana" from Google, and it's actually bananas (the speed of development). How does one keep up with everything happening? The frontier of new products that can be created keeps expanding exponentially.
On Context Windows: Doing so much LLM reading has made me think in terms of context windows. I think it's surprisingly similar to how the human brain works. When you feel calm after just waking up in the morning, you can imagine your context window being absolutely cleanâthat's the peace of seeing a blank Claude chat. As you start taking in more information, your context window starts to get cluttered.
On AI hype and FOMO: With so much noise around AI and every influencer starting their AI company, it can feel discouraging or trigger FOMO about the AI hype train. What's the point of doing all this deep study?
At moments like these, it's good to remember why I startedâto stay at the cutting edge, not just be happy at the application layer, but to truly participate in groundbreaking research. This was around the time of DeepSeek's release, and a bit of nationalistic pride helped. I felt that if India has to compete, we need to do groundbreaking research. I must be plugged into that research ecosystem and build enough skills to understand the research. First comes knowing the domain, then asking the right questions.
This journey continues to evolve. If you're on a similar path or have questions about any of these resources, feel free to reach out. Learning is always better when shared.