My NLP/AI learning journey

27 Aug, 2025

A personal documentation of diving deep into AI - from NLP foundations to speech AI and beyond

Why I took this up

The idea behind learning AI was mainly to stay at the cutting edge of technology. The focus and development of LLMs at OpenAI has a fascinating beginning; it wasn't planned but emerged from experiments, and the early team took note and decided to go all in on it, thereby transforming these experiments into path-breaking paradigms. There's a thrill I feel in being able to stay at the cutting edge of something, and that meant going deeper.

This reminds me of a Hacker News post where the original poster complained that coding and software engineering had become boring due to LLMs doing most of the coding. One person replied that the whole point of abstraction was to make things boring and easier. We started with compilers, then programming languages, and now have English itself as the programming language. This was the point of technical advancement. If one wants to have "fun" in understanding how things work, one has to go deeper into the stack to truly understand.

Starting with NLP: Choosing the Right Path

My initial thought for learning NLP was to take a certified course on Coursera. However, when I heard that Andrej Karpathy was launching Eureka Labs, I was both super happy and in dread that there are so many courses already; another one to join the mix? I told myself I'd be damned if I waited for Eureka Labs to come out before starting to study. No more waiting for the perfect course.

I had to choose between a certified course versus Andrej's weirdly named YouTube series "Neural Networks: Zero to Hero". I decided to take a bet since I'd read from multiple blogs these videos being the best starting point. More importantly, I also wanted to learn what it takes to teach well. For a non-university professor to garner such a following purely through teaching, he must be doing something right.

The YouTube Journey

So I started with the YouTube videos series. The best part is that he starts with a Jupyter notebook and walks through the code and model building as needed. I had to watch the videos multiple times—2-3 times minimum to intuitively grasp each step of the process. Working through the code is essential to get a better sense of everything.

In fact, CS336, a Stanford advanced course on LLMs, was motivated by these YouTube videos. In a 4-hour long video which is the last lecture of the YouTube series, Andrej builds the entire GPT-2. It contains a lot of useful information on GPU parallelisation, training quantization, and inference optimization. By the end, you're pretty deep into the subject.

Key takeaways from the series:

Definitely read the papers he suggests in the video comments—they help all concepts gel together
After going through the videos, the papers become much easier to understand
The questions he poses are brilliant; often the solution can be found in the next lecture
Maximum benefit comes from implementing the attention mechanism yourself

Going Closer to the Metal

I particularly liked that he implements things closer to the metal rather than using libraries. Although this increases the chances of shooting yourself in the foot, it gives a better understanding and visualization of the matrix operations in attention. Having seen the videos, I now prefer this raw approach over using einsum operations.

Reading his code is also a great experience—he writes in a very Pythonic and elegant way, including only what's necessary. This helps build intuition on how to write correct code. With LLMs happily generating more code than necessary, it helps to read good code, comprehend its conciseness and precision, and then ask LLMs to replicate the same in your own work.

NanoGPT is a prime example of this philosophy—only 300 lines of code but incredibly powerful, with many forks across the community.

Testing My Understanding: CS224N Assignments

Once I had worked through the code and seen all the videos, I wanted a bigger challenge—taking up assignments from scratch without assistance (university-level work without a TA). So I tried CS224N assignments to test how well I understood the subject material.

Chip Huyen mentions CS224N as a drawn-out course on her blog, and she wasn't wrong. The assignments are long and time-consuming, but you get a real sense of accomplishment after finishing them.

The initial assignments focus on traditional NLP rather than LLMs. While they might not be directly useful right now, they definitely help you understand how we reached this point with LLMs. What were the shortcomings of previous approaches? Remember, until 2022, most people hadn't even heard of LLMs, and work was done using traditional NLP methods.

Note: The solutions to CS224N assignments can be found online now that the class has finished. Feel free to reach out for any queries or brainstorming on solutions. It helps to think through the structure and be clear on pseudocode. With Claude Sonnet as my TA, I was able to ensure my code was correct and optimized.

Moving to Speech AI

Once I had a foothold in NLP, it was time to learn about speech AI. My nationalistic side told me that if I was going to learn AI, it had to benefit India—and that meant voice AI. If every farmer could have a Jarvis with them, what would that look like? That's the dream. Voice AI is undoubtedly the biggest opportunity for India's AI landscape.

I started reading about voice AI but was surprised by the lack of discourse on speech AI courses, as if the speech learning community doesn't want to reveal their secrets. After much searching, I found two good resources: the Hugging Face Audio Course and CS224S.

Learning Path for Speech AI

I started with the Hugging Face Audio Course—it's a gentle introduction to the field. I realised I needed to learn signal processing to understand speech AI properly. This book is an excellent primer for signal processing—not overly technical, and if you remember high school math, you should be fine.

After reading this book, I was able to grasp the nuances of speech ML much better.

As part of CS224S (Spring 2025), I learned about:

Phonetics
Evolution from ASR to TTS to modern speech models
Concatenative approaches to neural network approaches
The final project involved multilingual ASR

Project Results: The goal was to reduce Word Error Rate (WER) on low-resource languages below a certain threshold. The assignment asked us to bring it below 30%. With my enhancements using data augmentation techniques and combining training data from multiple languages, I managed to reduce it from 55% to 36%. There are still other approaches I plan to implement to bring it even lower.

I even wrote to Professor Andrew Maas asking if we could get video recordings for the course because slides can be difficult to follow. Later, I learned they won't be releasing videos. Nonetheless, the assignments and slides were incredibly helpful.

I noticed many Discord servers throwing around information with multiple links—for someone planning to learn in a structured manner, this becomes difficult to track and sometimes demotivating. Hence, I chose the structured course approach to dip my toes into speech AI.

Next Steps and Future Learning

Geospatial AI

I've always been fascinated with geospatial data. While between jobs around 9 months back, I was reading up on geospatial and remote sensing topics. I realised this would be fun, so I dabbled a bit there. I'll need more knowledge in computer vision. This is currently in progress.

Reinforcement Learning

I haven't fully dived into RL yet. I started watching some lectures by David Silver and was blown away—he explains very well. But is this something I'd love to pursue? I'm not sure yet.

Random Reflections

On AI's Pace: AI seems to be moving at an astonishing pace. I was recently looking up "nano banana" from Google, and it's actually bananas (the speed of development). How does one keep up with everything happening? The frontier of new products that can be created keeps expanding exponentially.

On Context Windows: Doing so much LLM reading has made me think in terms of context windows. I think it's surprisingly similar to how the human brain works. When you feel calm after just waking up in the morning, you can imagine your context window being absolutely clean—that's the peace of seeing a blank Claude chat. As you start taking in more information, your context window starts to get cluttered.

On AI hype and FOMO: With so much noise around AI and every influencer starting their AI company, it can feel discouraging or trigger FOMO about the AI hype train. What's the point of doing all this deep study?

At moments like these, it's good to remember why I started—to stay at the cutting edge, not just be happy at the application layer, but to truly participate in groundbreaking research. This was around the time of DeepSeek's release, and a bit of nationalistic pride helped. I felt that if India has to compete, we need to do groundbreaking research. I must be plugged into that research ecosystem and build enough skills to understand the research. First comes knowing the domain, then asking the right questions.

This journey continues to evolve. If you're on a similar path or have questions about any of these resources, feel free to reach out. Learning is always better when shared.