Most AI productivity articles show you curated demos, a clean prompt, a perfect response, a tidy outcome. This is not that article.
What follows is a forensic account of a real working session between a human, a writer wanting to turn a manuscript into an audiobook using their own voice, and Claude, Anthropic’s AI assistant. The session ran long, hit real errors, navigated SSL certificate failures, dependency conflicts, and a fundamental question about Nigerian accent preservation in AI voice synthesis.
The measure of AI productivity is not how fast it answers easy questions. It is how useful it remains when things go wrong.
By the end of the session, the human had a working Python pipeline generating audio from their book manuscript using a cloned version of their own voice, running locally on their Windows machine with GPU acceleration, no cloud subscription, no monthly fee, no vendor lock-in.
This article breaks down how that happened, what went wrong, how problems were diagnosed and resolved, and what this style of human-AI collaboration means for knowledge workers tackling technical problems outside their core expertise.
Mapping the Solution Space
The session opened with a simple, open-ended question:
“I want to create an audio book of a book I wrote using my own voice. How can I use AI for this?”
This is exactly the kind of question that benefits from AI assistance, it spans multiple technical domains, has no single correct answer, and requires navigating tradeoffs the questioner may not even know exist yet.
The Response Strategy: Options, Not Prescriptions
Rather than immediately recommending a single tool, the AI mapped the solution space into three distinct approaches:
Voice Cloning with AI TTS
Record a short voice sample, upload to a service like ElevenLabs, generate the full audiobook in your cloned voice.
Record Yourself, Use AI for Cleanup
Your real voice, cleaned up with tools like Adobe Podcast or Descript.
The Hybrid
Record key passages yourself, clone the rest.
This structure was deliberate. Rather than assuming the human wanted to avoid recording at all, or that they had a microphone, or that they were comfortable with cloud services, the AI presented distinct strategies that traded off differently on cost, effort, quality, and privacy.
Starting with a structured options map rather than jumping to a solution saves significant rework time. The human can self-select based on constraints the AI does not yet know about.
The Follow-Up That Changed Everything
The human’s follow-up revealed a crucial constraint that reshaped the entire direction:
“For option 1, is there a way I can do this on my own desk without using those services? I can code.”
Two critical pieces of information in one sentence: a preference for local execution (privacy, cost, control) and a technical capability (coding). This unlocked a completely different solution tier, open-source, GPU-accelerated, fully local voice cloning, that would have been inappropriate to recommend without knowing the human could implement it.
The best AI interactions are dialogues, not monologues. Each exchange narrows the solution space until a specific, actionable path emerges.
From Concept to Working Code
Once the human confirmed they had a GPU available, the session moved from exploration to execution. Claude produced two deliverables in a single response: a complete Python pipeline (audiobook_pipeline.py) handling the full workflow from text file to finished audiobook, and an interactive browser-based configuration dashboard (audiobook_ui.html) for building the command without memorising flags.
The pipeline included several non-obvious engineering decisions worth noting:
The 250-character chunking limit is a hard constraint of XTTS v2 that catches many first-time users. Building it into the pipeline from the start prevents hours of debugging later.
The Windows Problem Nobody Warns You About
The human was working on Windows. This single fact would be responsible for most of the subsequent debugging, not because Windows is poorly supported, but because the intersection of Windows + conda + PyTorch + CUDA + SSL environments creates a specific set of failure modes that are rarely documented together in one place.
A conda environment YAML file was produced with careful version pinning, including one decision that would prove important: installing ffmpeg via conda-forge rather than pip, so it would land on the PATH automatically without manual configuration.
The Debugging Marathon
Twelve issues, methodically resolved.
What followed was a sustained sequence of errors. Each one was diagnosed, explained, and resolved.
A Closer Look at Three Key Diagnostic Moments
The CUDA Version Discovery
When the fbgemm.dll error appeared, the immediate assumption was a missing Visual C++ redistributable, a common Windows PyTorch failure mode. The human ran the diagnostic command and came back with all VC++ runtimes present and up to date. The next step was nvidia-smi, which revealed CUDA 13.0, a version so new that no current PyTorch build targets it. The fix was to install the highest available PyTorch CUDA build (cu124) and rely on NVIDIA’s driver backwards compatibility.
Good debugging follows a hypothesis tree. Rule out the most common cause first, then move to less likely causes using targeted diagnostic commands rather than guessing.
The Transformers Version Sandwich
The transformers library conflict was particularly instructive. Version 4.44 was too old (missing is_torch_greater_or_equal). Version 4.46+ broke coqui-tts (removed isin_mps_friendly). The latest coqui-tts declared a requirement for 4.57+ which itself broke the import. The solution required finding the narrow working window, 4.47.1, and pinning both coqui-tts and transformers simultaneously so pip could resolve them as a unit.
Dependency conflicts in Python are rarely about a single package. They are about the intersection of multiple packages’ assumptions about a shared dependency’s API.
The Model Download That Wouldn’t Complete
The most frustrating failure was the model download. At 1.87GB, the XTTS v2 weights downloaded to 100% — twice — before the SSL connection was reset during the post-download verification step. The solution was bypassing Python’s HTTP stack entirely and downloading the six model files directly through a browser, then pointing the pipeline at the local directory via a new --model_dir CLI argument.
This also prompted a good design conversation: should the model directory be hardcoded, or should it be a CLI parameter? The human correctly identified that it should be a parameter — a small but meaningful software design decision that emerged naturally from the debugging context.
The Questions Beyond the Code
Some of the most valuable exchanges in the session were not about code at all.
“How many minutes should I record on my phone?”
A practical question with a nuanced answer. The model technically needs six seconds but produces noticeably better clones with 60–90 seconds of reference audio. Beyond three minutes there is no meaningful quality improvement. The specific advice: aim for 90 seconds, read varied sentence types from your own book, record in a soft room (a wardrobe full of clothes works well), use the phone’s rear microphone, and convert to WAV with a single ffmpeg command before running the pipeline.
“Can I sell the audiobook I generate with this?”
A genuine licensing question with a genuinely complicated answer. XTTS v2 is released under the Coqui Public Model License, which prohibits building competing TTS products but was not designed to prevent authors from selling content they create with the tool. The complicating factor: Coqui AI shut down in January 2024, so there is no longer an active licensor to contact.
AI tools often exist in licensing grey areas, especially when the original company has shut down. Understanding the risk profile, not just whether something is technically permitted, is part of making good decisions.
“Is there any work on preserving Nigerian accents?”
Perhaps the most interesting question in the session. The human’s book, A Jar of Clay, is set in Nigeria. The AI-generated voice sounded American. This was not a bug, rather, it was a fundamental limitation of XTTS v2’s training data, which skews heavily towards Western English.
A web search surfaced several genuinely relevant findings:
A fine-tuned model trained on over 300 hours of Nigerian-accented audio covering Igbo, Yoruba, and Hausa speakers, with voice cloning support, available on HuggingFace.
Building a Nigerian English TTS system on the StyleTTS2 architecture using community-curated data from Yoruba, Igbo, and Hausa speakers.
A Nigerian-focused TTS project built specifically for Nigerian languages and accents.
The honest recommendation: for a book where accent authenticity matters to the work itself, consider recording yourself and using AI only for audio cleanup. Your authentic Nigerian accent is an asset, not something to approximate.
What This Session Illustrates
The Compounding Value of Context
The session’s productivity gains were not linear. Each exchange built on the last. By the end, the AI knew the human was working on Windows, had an RTX 3060 Ti with CUDA 13.0, was using Anaconda, was behind a network that blocked SSL on source package builds, was writing a novel set in Nigeria, and cared about accent authenticity. None of this was stated upfront, it emerged through the natural flow of problem and response.
A human expert consulting in the same role would have taken 20–30 minutes of scoping questions to gather the same context. The AI gathered it in the course of doing the work.
Honest Failure Modes
The session also illustrates where AI assistance has real limits:
- The transformers version conflict required three iterations to resolve because the working window was narrow and underdocumented.
- The accent limitation was a genuine surprise. XTTS v2’s Western bias is not prominently disclosed in its documentation.
- One script bug was acknowledged directly: the
model_pathvscheckpoint_direrror was the AI’s mistake, not the user’s environment.
Productivity with AI is not about getting the right answer first time. It is about shortening the distance between the wrong answer and the right one.
The Skill the Human Brought
It would be a mistake to read this session as AI doing everything. The human brought:
— knowing they wanted voice cloning, having a manuscript ready, knowing their hardware specs.
— correctly identifying that
model_dir should be a CLI parameter, not a hardcode.— copying exact error messages in full, which made diagnosis possible.
— understanding why Nigerian accent authenticity mattered to their specific work.
The AI provided breadth, pattern recognition across a large technical knowledge base, and sustained focus across a long debugging session. The human provided judgment, context, and the questions that mattered.
of independent research and debugging, compressed into a single session, not because the AI knew everything, but because it could move from hypothesis to fix faster than any search engine.
The Session Is the Product
The output of this session was not just a Python script. It was a working mental model, built collaboratively, of how voice cloning works locally, what its limitations are, where the licensing boundaries sit, and what the frontier looks like for underrepresented accents in AI speech synthesis.
That kind of contextual understanding is hard to acquire from documentation alone. It usually comes from the kind of dialogue that used to require access to an expert willing to sit with you through the messy middle of a technical problem.
What AI assistance offers, at its best, is not expertise on demand. It is a patient, context-aware thinking partner who can hold the whole problem in view while you work through it piece by piece.
The most productive use of AI is not replacing the human in the loop. It is making the loop faster, more informed, and less lonely.
A Jar of Clay is being written by a human, about human experience, in a voice shaped by a specific place and culture. The technology serving that project should aspire to the same fidelity. We are not there yet. But we are, as this session shows, closer than we were.
A Jar of Clay · Tade Oyebode
Session conducted April 2026.