Citation City

About

If there's an itch, I will scratch it. I've been listening to a couple of podcasts of late and they reference a lot of entities that I think get lost. Books, films, actors, TV shows, podcasts and artists across all of the spectrums so I was looking for a way to generate a list that I can reference. As much as I like stopping the pod, watching a trailer, and starting back up again, the flow of the pod gets lost. I want to get to the end and then follow the trail of breadcrumbs.

Time to talk about Named Entity Recognition (NER).

Watch as your eyes glaze over. Named entities are a fundamental concept in Natural Language Processing (NLP) and Information Extraction (IE). They refer to specific objects, concepts, or entities that can be identified and classified into predefined categories.

Still with me?

Methodology

Build a pipeline to scrape the podcast transcripts, run them through an ML model to extract the named entities, and link them to their relevant sources.

Nope. Fun fact: due to the data wars going on right now there's no way to get a podcast transcript. Believe me: I've tried.

Also: running my own transcription service is a learning device. Don't be lazy: learn.

Transcription Service

I say rabbit hole a lot, however, nailing the transcription service was the toughest part of the project. If only I had a GPU under my desk.

NER Service

Using Google's Gemini 1.5 Pro to extract not just named entities, but behavioral patterns too. What do they debate most? What deep dives do they take? What childhood memories surface? What funny moments emerge? The AI identifies these patterns across episodes.