From Idea to Audiobook with Claude.AI



A Forensic Account · April 2026

From Idea to Audiobook

A real working session with AI that hit twelve errors, navigated SSL failures and dependency hell, and delivered a working pipeline, built locally, on a writer’s own desk.



Most AI productivity articles show you curated demos, a clean prompt, a perfect response, a tidy outcome. This is not that article.

What follows is a forensic account of a real working session between a human, a writer wanting to turn a manuscript into an audiobook using their own voice, and Claude, Anthropic’s AI assistant. The session ran long, hit real errors, navigated SSL certificate failures, dependency conflicts, and a fundamental question about Nigerian accent preservation in AI voice synthesis.

The measure of AI productivity is not how fast it answers easy questions. It is how useful it remains when things go wrong.

By the end of the session, the human had a working Python pipeline generating audio from their book manuscript using a cloned version of their own voice, running locally on their Windows machine with GPU acceleration, no cloud subscription, no monthly fee, no vendor lock-in.

This article breaks down how that happened, what went wrong, how problems were diagnosed and resolved, and what this style of human-AI collaboration means for knowledge workers tackling technical problems outside their core expertise.



Part One

Mapping the Solution Space

The session opened with a simple, open-ended question:

The Human

“I want to create an audio book of a book I wrote using my own voice. How can I use AI for this?”

This is exactly the kind of question that benefits from AI assistance, it spans multiple technical domains, has no single correct answer, and requires navigating tradeoffs the questioner may not even know exist yet.

The Response Strategy: Options, Not Prescriptions

Rather than immediately recommending a single tool, the AI mapped the solution space into three distinct approaches:

Option One

Voice Cloning with AI TTS

Record a short voice sample, upload to a service like ElevenLabs, generate the full audiobook in your cloned voice.

Option Two

Record Yourself, Use AI for Cleanup

Your real voice, cleaned up with tools like Adobe Podcast or Descript.

Option Three

The Hybrid

Record key passages yourself, clone the rest.

This structure was deliberate. Rather than assuming the human wanted to avoid recording at all, or that they had a microphone, or that they were comfortable with cloud services, the AI presented distinct strategies that traded off differently on cost, effort, quality, and privacy.

💡
Productivity Insight

Starting with a structured options map rather than jumping to a solution saves significant rework time. The human can self-select based on constraints the AI does not yet know about.

The Follow-Up That Changed Everything

The human’s follow-up revealed a crucial constraint that reshaped the entire direction:

The Human

“For option 1, is there a way I can do this on my own desk without using those services? I can code.”

Two critical pieces of information in one sentence: a preference for local execution (privacy, cost, control) and a technical capability (coding). This unlocked a completely different solution tier, open-source, GPU-accelerated, fully local voice cloning, that would have been inappropriate to recommend without knowing the human could implement it.

The best AI interactions are dialogues, not monologues. Each exchange narrows the solution space until a specific, actionable path emerges.



Part Two

From Concept to Working Code

Once the human confirmed they had a GPU available, the session moved from exploration to execution. Claude produced two deliverables in a single response: a complete Python pipeline (audiobook_pipeline.py) handling the full workflow from text file to finished audiobook, and an interactive browser-based configuration dashboard (audiobook_ui.html) for building the command without memorising flags.

The pipeline included several non-obvious engineering decisions worth noting:

1
Smart chapter detection using regex to recognise common heading patterns (Chapter 1, Part Two, Prologue) and split the book accordingly.
2
Sentence-aware chunking respecting XTTS v2’s ~250 character input limit, breaking at punctuation boundaries rather than mid-sentence.
3
Automatic stitching of chunks into chapters, and chapters into a full book, with configurable silence gaps between sections.
4
GPU auto-detection with graceful CPU fallback.
5
Progress bars per chapter using tqdm for live feedback during long generations.

⚙️
Engineering Note

The 250-character chunking limit is a hard constraint of XTTS v2 that catches many first-time users. Building it into the pipeline from the start prevents hours of debugging later.

The Windows Problem Nobody Warns You About

The human was working on Windows. This single fact would be responsible for most of the subsequent debugging, not because Windows is poorly supported, but because the intersection of Windows + conda + PyTorch + CUDA + SSL environments creates a specific set of failure modes that are rarely documented together in one place.

A conda environment YAML file was produced with careful version pinning, including one decision that would prove important: installing ffmpeg via conda-forge rather than pip, so it would land on the PATH automatically without manual configuration.



Part Three

The Debugging Marathon

Twelve issues, methodically resolved.

What followed was a sustained sequence of errors. Each one was diagnosed, explained, and resolved.

# Error Root Cause Resolution
1 docopt build failure Proxy blocking pip’s setuptools download during source build Pre-install via conda-forge to use binary wheel
2 fbgemm.dll load error PyTorch CUDA 11.8 vs driver CUDA 13.0 mismatch Reinstall PyTorch with cu124 build
3 isin_mps_friendly import error transformers 5.x removed function coqui-tts depends on Pin transformers to 4.47.1
4 is_torch_greater_or_equal error transformers 4.44 too old for coqui-tts 0.27.5 Pin to transformers 4.47.1
5 Dependency conflict warning coqui-tts 0.27.5 requires transformers>=4.57 (which itself breaks) Downgrade to coqui-tts 0.26.0 + transformers 4.47.1
6 encodec build failure Same SSL issue blocking setuptools for source build pip install with –no-build-isolation
7 Model download SSL failure SSL handshake reset on large file download from HuggingFace Manual browser download of model weights
8 ValueError: checkpoint_dir TTS API expects folder path, not model.pth file path Pass model_dir folder, not individual file
9 OSError: invalid filename Chapter heading contained tab character, invalid path on Windows Replace whitespace regex to catch all \s
10 Front matter not read detect_chapters discarded pre-chapter content Capture text before first heading as Front Matter section
11 Corrupt MP3 output pydub MP3 export failing silently Default keep_chunks=True so WAVs always preserved
12 Accent not preserved XTTS v2 trained predominantly on Western English Multiple reference WAVs; flagged limitation honestly

A Closer Look at Three Key Diagnostic Moments

The CUDA Version Discovery

When the fbgemm.dll error appeared, the immediate assumption was a missing Visual C++ redistributable, a common Windows PyTorch failure mode. The human ran the diagnostic command and came back with all VC++ runtimes present and up to date. The next step was nvidia-smi, which revealed CUDA 13.0, a version so new that no current PyTorch build targets it. The fix was to install the highest available PyTorch CUDA build (cu124) and rely on NVIDIA’s driver backwards compatibility.

🔍
Diagnostic Pattern

Good debugging follows a hypothesis tree. Rule out the most common cause first, then move to less likely causes using targeted diagnostic commands rather than guessing.

The Transformers Version Sandwich

The transformers library conflict was particularly instructive. Version 4.44 was too old (missing is_torch_greater_or_equal). Version 4.46+ broke coqui-tts (removed isin_mps_friendly). The latest coqui-tts declared a requirement for 4.57+ which itself broke the import. The solution required finding the narrow working window, 4.47.1, and pinning both coqui-tts and transformers simultaneously so pip could resolve them as a unit.

Dependency conflicts in Python are rarely about a single package. They are about the intersection of multiple packages’ assumptions about a shared dependency’s API.

The Model Download That Wouldn’t Complete

The most frustrating failure was the model download. At 1.87GB, the XTTS v2 weights downloaded to 100% — twice — before the SSL connection was reset during the post-download verification step. The solution was bypassing Python’s HTTP stack entirely and downloading the six model files directly through a browser, then pointing the pipeline at the local directory via a new --model_dir CLI argument.

This also prompted a good design conversation: should the model directory be hardcoded, or should it be a CLI parameter? The human correctly identified that it should be a parameter — a small but meaningful software design decision that emerged naturally from the debugging context.



Part Four

The Questions Beyond the Code

Some of the most valuable exchanges in the session were not about code at all.

“How many minutes should I record on my phone?”

A practical question with a nuanced answer. The model technically needs six seconds but produces noticeably better clones with 60–90 seconds of reference audio. Beyond three minutes there is no meaningful quality improvement. The specific advice: aim for 90 seconds, read varied sentence types from your own book, record in a soft room (a wardrobe full of clothes works well), use the phone’s rear microphone, and convert to WAV with a single ffmpeg command before running the pipeline.

“Can I sell the audiobook I generate with this?”

A genuine licensing question with a genuinely complicated answer. XTTS v2 is released under the Coqui Public Model License, which prohibits building competing TTS products but was not designed to prevent authors from selling content they create with the tool. The complicating factor: Coqui AI shut down in January 2024, so there is no longer an active licensor to contact.

⚖️
Licensing Reality Check

AI tools often exist in licensing grey areas, especially when the original company has shut down. Understanding the risk profile, not just whether something is technically permitted, is part of making good decisions.

“Is there any work on preserving Nigerian accents?”

Perhaps the most interesting question in the session. The human’s book, A Jar of Clay, is set in Nigeria. The AI-generated voice sounded American. This was not a bug, rather, it was a fundamental limitation of XTTS v2’s training data, which skews heavily towards Western English.

A web search surfaced several genuinely relevant findings:

Hypa_Orpheus-3b

A fine-tuned model trained on over 300 hours of Nigerian-accented audio covering Igbo, Yoruba, and Hausa speakers, with voice cloning support, available on HuggingFace.

A NeurIPS 2025 Research Project

Building a Nigerian English TTS system on the StyleTTS2 architecture using community-curated data from Yoruba, Igbo, and Hausa speakers.

YarnGPT

A Nigerian-focused TTS project built specifically for Nigerian languages and accents.

The honest recommendation: for a book where accent authenticity matters to the work itself, consider recording yourself and using AI only for audio cleanup. Your authentic Nigerian accent is an asset, not something to approximate.



Part Five

What This Session Illustrates

The Compounding Value of Context

The session’s productivity gains were not linear. Each exchange built on the last. By the end, the AI knew the human was working on Windows, had an RTX 3060 Ti with CUDA 13.0, was using Anaconda, was behind a network that blocked SSL on source package builds, was writing a novel set in Nigeria, and cared about accent authenticity. None of this was stated upfront, it emerged through the natural flow of problem and response.

A human expert consulting in the same role would have taken 20–30 minutes of scoping questions to gather the same context. The AI gathered it in the course of doing the work.

Honest Failure Modes

The session also illustrates where AI assistance has real limits:

  • The transformers version conflict required three iterations to resolve because the working window was narrow and underdocumented.
  • The accent limitation was a genuine surprise. XTTS v2’s Western bias is not prominently disclosed in its documentation.
  • One script bug was acknowledged directly: the model_path vs checkpoint_dir error was the AI’s mistake, not the user’s environment.

Productivity with AI is not about getting the right answer first time. It is about shortening the distance between the wrong answer and the right one.

The Skill the Human Brought

It would be a mistake to read this session as AI doing everything. The human brought:

Domain knowledge
— knowing they wanted voice cloning, having a manuscript ready, knowing their hardware specs.
Judgment
— correctly identifying that model_dir should be a CLI parameter, not a hardcode.
Patience and precision
— copying exact error messages in full, which made diagnosis possible.
Creative vision
— understanding why Nigerian accent authenticity mattered to their specific work.

The AI provided breadth, pattern recognition across a large technical knowledge base, and sustained focus across a long debugging session. The human provided judgment, context, and the questions that mattered.

Conservative Estimate
3–5 days

of independent research and debugging, compressed into a single session, not because the AI knew everything, but because it could move from hypothesis to fix faster than any search engine.



The Session Is the Product

The output of this session was not just a Python script. It was a working mental model, built collaboratively, of how voice cloning works locally, what its limitations are, where the licensing boundaries sit, and what the frontier looks like for underrepresented accents in AI speech synthesis.

That kind of contextual understanding is hard to acquire from documentation alone. It usually comes from the kind of dialogue that used to require access to an expert willing to sit with you through the messy middle of a technical problem.

What AI assistance offers, at its best, is not expertise on demand. It is a patient, context-aware thinking partner who can hold the whole problem in view while you work through it piece by piece.

The most productive use of AI is not replacing the human in the loop. It is making the loop faster, more informed, and less lonely.

A Jar of Clay is being written by a human, about human experience, in a voice shaped by a specific place and culture. The technology serving that project should aspire to the same fidelity. We are not there yet. But we are, as this session shows, closer than we were.

A Jar of Clay · Tade Oyebode

Session conducted April 2026.

Leave a comment