AI App development journal
Apps
Over the past month I've started building apps with AI in a day that would have taken 5-10X longer before. The goal of this post is to capture my learnings. I'll try and summarize each app at the top, share an overview of the development process, and then a running log at the bottom.
- Goaly - daily progress tracker
- PDF OCD - pdf ocr job for personal documents
- Paper 2 Pod - turn academic papers into compelling podcasts
- Investor call viewer - video + transcript viewer for investor calls
Goaly
Like many people, I struggle to continue making progress on a goal. I always start with the best of intentions, but a couple of weeks later the motivation has slowly faded. There are lots of progress trackers out there, but they've all felt like overkill.
With Goal Journal I wanted to explore an audio input, where logging my progress was as simple as opening the app and pressing record. Behind the scenes, I'd transcribe the recording, summarize it with AI, and log it to the appropriate goal.
PDF OCD
My wife and I scan all of the documents that we receive so that they're easier to share and find later. This might be a bit OCD (hence the name), but we've found it's a simple hack for putting PDFs in our todo list. The problem is that when you want to find a document from a year or two ago, most of the documents are not OCR'd and the filenames are terrible. PDF OCD runs hourly in the background and when it finds a new PDF it OCR's and uses AI to generate a new filename with the date, company, and description of the doc.
Paper 2 Pod
Recently, I've been using ChatGPT to learn more about some technical topics like the economic impact of climate change and the design of Gen 4 fission reactors. These topics benefits reading the original source work, but the content and pretty dry and who wants to sit in front of screen all day!
With Paper 2 Pod, the kernel of the idea was how cool would it be if we could turn an academic paper, a collection of articles, or even a book into a compelling conversational long form podcast. This project can be broken down into four segments: researching the papers and building a curriculum of content to cover, turning the content into a compelling script (ideally with persistent characters and guests), turning the script into compelling audio, and finally allowing the user to join the conversation at any point (which would be dope).
Investor call viewer
I have a tendency to rabbit hole which is a nice way of saying obsess. Recently, I've let myself run wild and learn everything I could about Oklo, an advanced nuclear reactor startup, which has included watching investor calls.
As part of the project, I've found it frustrating not to have a transcript that i can search and reference afterwards. With Investor call viewer, I can now watch the investor calls and read the transcript at the same time and because it's interactive, I can always search for a topic in the transcript and jump to the portion of the video.
Like the earlier projects, this was another opportunity to revisit generating transcripts, but this time with an emphasis on readability and keeping the script in sync with the original audio timestamps.
Log
December 9th - Theme button
December 7th - reader mode
I asked Composer to create a reader mode button on my blog. There were three challenges. The first was learning how to hide the side panels. The second was learning how to position the button with the
reshaped
library. The third was learning how to make the layout interactive despite it being a server component.It took one nudge to get the layout the hide the side panels with the right API. It took a couple of nudges to learn how to use the
reshaped
library, but giving it access to the docs and running tsc
fixded the issues. It took one nudge to get the interaction to work with RSC which was super impressive.December 5th - Consolidating transcripts
The transcripts app has logic for consolidating the timestamped segments so that the transcript is easier to read. There are a couple of rules here. First its good to try and have segments that are 30 to 60s long. Second, it's nice to create segments when the person mentions a slide like "going to a slide" since that's a logical place to start reading a new thought.
This logic has a suprising number of edge cases given that the transcript needs to be parsed. And as a result the agent kept failing. With agent mode enabled, I asked it to look at its own output and it was able to conclude it failed, but it was not able to fix it. I then asked it to write unit tests to exercise edge cases and add fine grained logging to look at the values. This approach was good, but even with agent mode which could run the tests, look at the terminal output, and then re-write the code it kept failing.