Subnet 78: Vocence

Vocence is Bittensor Subnet 78, a subnet for voice intelligence. The Vocence README source describes a network for evaluating models that can follow natural-language voice instructions, with the current focus on prompt-based text-to-speech: generating spoken audio from a written description of both what to say and how it should sound.

What Vocence Rewards

A request to Vocence is a prompt that specifies the content and the voice — for example, a calm middle-aged male voice with a warm tone, speaking slowly, reading a given sentence. The subnet rewards miners whose models produce audio that best satisfies that request, judged on three things: whether the speech matches the text, whether the audio is clear and natural, and how closely the voice matches the requested traits such as tone, emotion, pitch, speed, age, and accent. Because every model is measured the same way on the same prompts, the comparison stays consistent across miners.

Evaluation Context

Vocence’s evaluation problem is more specific than ordinary text-to-speech quality. A good answer must preserve the requested words, sound natural, and match the requested voice traits. The scoring documentation describes tasks derived from source audio, where validators turn the source into a voice prompt and then compare generated miner audio back against the expected content and traits.

This matters because a miner can fail the task in several different ways. It might speak the right words with the wrong emotion, match the voice style while dropping content, or produce audio that is technically clear but sounds less natural than the reference. Treating those dimensions separately lets validators reward models that are useful for voice generation rather than models that optimize only one visible property.

Task Specification Context

The scoring documentation describes each evaluation task as coming from real source audio. A validator extracts a structured specification from that source, including the transcript and voice traits such as gender, pitch, speed, age group, emotion, tone, and accent. The same source also produces a natural-language instruction that describes the voice in a user-like prompt.

Those two views serve different purposes. Miners receive the text and natural-language instruction, not the structured scoring table. The structured fields are retained for evaluation, so validators can compare the generated audio back against the source-derived traits. This lets Vocence test whether a model can follow natural prompts while still giving validators a repeatable scoring target.

The scoring docs also separate two judging steps. First, the miner’s generated audio is analyzed for the expected content and voice traits. Second, the miner audio is compared with the source audio for naturalness. The final evaluation combines those element scores, and the binary win flag used for ranking comes from whether the generated audio clears the pass threshold.

The same source distinguishes diagnostic detail from the ranking primitive. Validators store the continuous score, per-element breakdown, extracted traits, and naturalness result, but winner selection uses the generated-wins flag derived from the pass threshold. This lets the subnet keep audit detail without turning every diagnostic number into a separate reward rule. It also keeps the source-derived task specification useful for review after the audio has been generated.

This distinction keeps prompt adherence from being treated as a single vague quality score. A model can be measured on whether it spoke the right words, whether the requested traits were preserved, and whether the result sounded natural. The article’s three scoring dimensions map to those source-described task and evaluation stages.

References: Vocence scoring source, Vocence README source

Scoring Context

The scoring docs also describe cross-validator aggregation for weight setting. Validators generate their own evaluation samples, but weight decisions are intended to use recent evidence from active validators rather than only one validator’s local sample window. That design reduces drift between honest validators on a winner-takes-all subnet.

For readers, the important point is that Vocence rewards consistent voice-model performance across shared dimensions: content accuracy, audio naturalness, and prompt adherence. The subnet’s output is therefore not just an audio clip, but an evaluated model capability that can be compared across miners and across validator samples.

Miner and Validator Roles

Miners build and run the voice models. A miner trains a prompt-to-speech model and deploys it so it can be queried, then competes to generate the highest-quality audio for the evaluation prompts it receives.

Validators measure that quality. They send a shared set of evaluation prompts to each miner’s model, score the returned audio on content accuracy, audio quality, and adherence to the requested voice, and set weights on the network so the best models earn the most, which is how reward flows through Yuma Consensus.

Source and Live Data

The codebase is maintained in the vocence-78/vocence repository. Live SN78 data is available on TaoStats. The mechanism details in this article are tied to the public README and scoring documentation rather than to live identity fields.

Relationship to Yuma Consensus

Subnet 78 uses Yuma Consensus to convert the voice-quality weight vectors that validators submit into the emission shares distributed to miners and validators within the subnet each tempo. The Yuma Consensus documentation describes how validator weight submissions are aggregated into consensus weights for each miner registered on the subnet.

In Vocence’s context, validators send a shared set of evaluation prompts to each miner’s model, score the returned audio on content accuracy, audio naturalness, and adherence to the requested voice traits, and aggregate scores across recent validator evidence before submitting weight vectors. The winner-takes-all ranking rewards consistent high performance on all three scoring dimensions. The Emission documentation describes how those consensus weights determine each participant’s share of the subnet’s accumulated emission each tempo.

Reader Boundary

Subnet 78 Vocence should not be read as generic Bittensor subnet documentation, a general voice cloning service, or a guarantee that any one audio clip implies stable long-term model quality. It names one subnet’s prompt-to-speech evaluation market, where validators score miner models on content accuracy, audio naturalness, and adherence to requested voice traits under shared evaluation prompts (Vocence README source, Vocence scoring source).

The scoring docs also separate miner-visible prompts from validator scoring tables. Miners receive a natural-language voice instruction and text to speak, while validators retain structured trait fields extracted from source audio to measure adherence and naturalness. That boundary ties rewards to following user-like prompts rather than to optimizing for hidden scoring fields alone.

Validator weights still flow through Yuma Consensus to determine emissions each tempo (Yuma Consensus, Emission).

Development Stage Context

The Introduction to Bittensor describes subnet development as moving from localnet to testnet and then mainnet. For Subnet 78, that sequence applies to the standard Bittensor lifecycle: localnet for isolated development, testnet for shared non-production testing, and mainnet for live operation with real emissions.

On mainnet, Subnet 78 is registered as the live production subnet at netuid 78. The Bittensor Networks reference separates mainnet, testnet, and localnet. Participation examples or emission outcomes from one environment should not be read as representing production subnet performance in another environment.