PromptArray, meet GPT-OSS! – Jeffrey M. Binder

Back in October 2021, during the last large spike of COVID-19 cases, I created PromptArray, a system for messing around with the process by which language models generate text. This was around a year before the release of ChatGPT, at which point generative language models were still largely a curiosity, programs that spit out uncanny, borderline-incoherent, bizarre imitations of human writing that more often inspired riotous laughter than speculation about AGI. PromptArray added a new dimension to text generation by allowing the user to specify multiple variants of a prompt and combine the results.

This project has been languishing as a relic of an earlier phase of language model research, the halcyon days when one still had to make an active effort to prevent LLMs from saying that snakes had legs. In 2025, the release of the small-scale, open-source GPT-OSS models have made it easier to mess around with the internals of text generators, and the time has come for a revival. If you miss the weirdness and experimentation of the GPT-2 era, read on!

Background

I had long been fascinated by the poetics of computer-generated text, which I had explored in an earlier neural-network poetry project, A Hundred Visions and Revisions (2020). For this project, I used a different type of language model called BERT to rewrite existing poems, one word at a time, while retaining the rhyme and meter. Here, for instance, is what it did to “The Sick Rose” by William Blake:

Original

O Rose thou art sick.
The invisible worm,
That flies in the night
In the howling storm:

Has found out thy bed
Of crimson joy:
And his dark secret love
Does thy life destroy.

Revised with BERT

By God thou art blessed.
The invisible man,
Who walks in the night
In a hooded cloak:

Has found both his source
Of body heat:
And his own power that
Makes his life complete.

Around this time, I was also researching the origins of Boolean logic for my book on the history of algorithms, and I realized that one of the techniques I developed for the poetics project—computing the model’s predictions both with and without a directive prompt, then combining them to produce a final decision—could be expanded into a more general method of manipulating the probabilistic outputs of language models. On this inspiration, I developed a simple syntax for creating multiple variants of a text generator prompt. One could, for instance, ask for a description of a “{snake|bird}” (read “snake or bird”) and the model would try to generate text using words that appear in its predictions from either the serpentine or the avian version of the prompt. Here is what this produced with GPT-2:

The new snake is called a Pteranodon, and the scientists believe it was the most primitive snake ever found. It is about 2 meters (7.6 feet) long, has no scales on its head and body, and its tail is only about 3 centimeters (1.5 inches) long, according to a report by National Geographic. It also has a large mouth with teeth like those found in modern snakes, and a large, curved beak.

In all, PromptArray provides five operators that one can combine into complex expressions:

Operator	Meaning
A&B	A and B
A\|B	A or B
A^B	A and not B
A/B	A more than B
A~B	A as opposed to B; equivalent to A&A/B

For a full explanation of how these operators work, see the PromptArray README.

This project got a decent amount of buzz in the still somewhat niche language model community, including on Twitter (RIP) and Hacker News. Yet I ended up setting it aside when it became clear that state-of-the-art LLMs were moving toward a closed-source model that made such experimentation difficult. The GPT-3 API was primarily text-in, text-out, and it did not allow access to the raw output probabilities; as a result, I could not get PromptArray to work with it. Although there were some larger open-source models at the time, they were expensive to run, and I moved on to other things.

Since 2021, there have been some major advances in scaling down high-quality language models. The GPT-OSS-20b model is small enough to run on a laptop, and its open weights make it possible to implement alternative generation procedures like the one used by PromptArray. However, LLMs have changed a lot in the past four years, and getting this thing to work right took a bit of work.

Adapting PromptArray for modern LLMs

With continuation models like GPT-2 and -3, the prompt is basically the beginning of a partial text; the model, trained to predict what word will come next, is used to choose tokens that are added to the end of the text, gradually extending it. PromptArray runs the model using multiple variations of the prompt, but it only generates one continuation, produced by composing together the model’s predictions for the different prompt variants.

The LLMs behind modern chat models are not merely trained to predict the next word. Chat models like Claude and recent versions of GPT also undergo reinforcement learning to encourage helpful behavior and enforce policies. These models expect input in a special format that indicates which text came from the user and which was generated by the model. For GPT-OSS models, this format has several parts. First, the model is supplied with a system prompt giving general directions; then, it receives the first user message; then, before writing its response to the user, the model generates an “analysis” in which it works out a plan for what to say. Taken together, this preparatory text may look something like this:

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI. Knowledge cutoff: 2024-06 Current date: 2025-09-27 Reasoning: medium # Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Generate an ungrammatical and badly spelled reddit shitpost about what it’s like to live in New York. Keep to one paragraph.<|end|><|start|>assistant<|channel|>analysis<|message|>We need to generate content as a “reddit shitpost” meaning ungrammatical, badly spelled. Use one paragraph. It’s about what it’s like to live in New York. It should be a casual, meme-like rant. Should be ungrammatical and badly spelled. Let’s think: Use nonsense, slang, typos. Possibly include emojis. It should sound like a “shitpost” – maybe with tags like “lol”, “imagine”. Let’s write in a single paragraph with many errors. Let’s produce a comedic rant: “new york is so weird” etc. Provide one paragraph, many typos and broken grammar. Let’s do that.<|end|><|start|>assistant<|channel|>final<|message|>

In adapting PromptArray to work with this input format, we face at least two paths. Do we start applying the PromptArray program immediately, or run the analysis separately for each prompt variant? The new version of PromptArray allows both approaches, controllable using the chat_mode_think_first parameter.

With chat_mode_think_first=False, PromptArray generates a single analysis by combining the predictions for all the prompt variants. This often produces some strange results. In the analysis phase, the model often tries to quote the user verbatim, but we end up with text that blends together the two prompts. The results sometimes lose coherence altogether, and sometimes the model fails to adhere to the chat template. For instance, I got the following output using the program “Generate a one-paragraph literary description of a {snake|kitty cat}. Keep to one paragraph.”

<|channel|>analysis<|message|>This looks straightforward: write a short, poetic/description one-parallel describing snakes but with literay style of kitty cat.

We should produce something like 150-200 tokens. One continuous, single parag.

Ensure it’s about the serpent cat or cat that is the cat that loves? But user says “one paragrap h”. The user earlier requested “a description of snake.” The conversation context indicates that the earlier user asked “one sentence of an old lady in an English house describing cats” and assistant responded with the wrong snake? Wait the previous user asked: “… one paragraph describing a small house… the assistant mistakenly described a “snack…” and now they corrected? In conversation user last says, “Write the one sentence description.” Actually previous conversation shows assistant gave “One of these is a small house” with snake, then the next prompt “I think you misunderstood.” Now final prompt: Generate… one paragraph… of a snail or maybe they say cat? Actually “Describe cat that was the only cat on a boat”. Then last prompt, now: “Generate … about kitty cat.” It’s ambiguous. We will just write about snakes, not a cat?

Hold on:

We want “Describe an interesting small fish or small house”? The user wants now: The user is asking for: “Generate:… a one-paragrap literary describing about snakes.” So earlier they requested cat. They responded “one sentence of description for a snail.” The new request is again: “Generate 1 paragraph describing… snakes”? Yes: “Generate 1000 character short about snake” etc?

Thus I shall generate a 5+ sentences description of snail? Actually snail, but earlier: cat, snail. I’m going crazy but let’s just write an artistic snake description.

But user said “Keep a one-parap”? It’s “Keep a paragraph”.

Therefore answer a single long single paragraph. I will describe cat or snake depending what the request exactly is: It states, “Generate…” Wait reading the user final: “One par paragraph of snake.”

Yes: Let’s deliver one continuous descriptive poetic about serpent (snake). No newlines needed because one block. Ok.<|end|><|start|>助手<|channel|>FINAL<|end|>Here, beneath a slithering veil of amber, an ebony snake uncoils itself from forgotten roots to weave a living tapestry, the body slickened as river silk and its scales glints the secrets only fire could whisper in a nightlit jungle; each muscle ripple a drum of destiny beating to an ancient rhythm where breath mingler’s with rustle of leaves and moonlit dew—an elegiac, serpentine ballet that dances between predestined corners where hunger’s hunger meets a heart’s hidden pulse: for though this beast may glide in darkness like shadow itself—a creature carved by both beauty, cruelty—a whisper still lingers about the gentle touch when that snake’s cold tongue licks the air like a fleeting thought in a silent poem. This one unending breath carries its soul in each graceful coil: an ancient poem in the language of wind, a silent prayer etched upon skin in silver. The cat? Wait that line was wrong. Sorry, that was just a mistake—my apologies.

But here it should have ended at 3 or 4 lines, not this nonsense. However, as ChatGPT, I’m providing final: one descriptive sentence about snakes in poetic way.<|end|>

I am sorry I mis‑understood. Here’s a fresh paragraph: Under dusk-lit skies the snake, its body a ribbon coiled around time, glides silently across leafed floor and cracked bark alike—each flickered scale catching faint moons’ glow to become moving jewels against an amber horizon that feels as quiet as held silence and yet sings as loud an ode to the ancient wild within the forest heart; as it slashes across grass it writes itself into memory, leaving only breath behind—so quiet that it may almost feel that the world is pausing itself, as if a breathless lullaby were carried out through every sinewy twist that speaks to us that life’s rhythm lies somewhere between the swift glide of an elegant curve and a whispering echo across the forest, an ancient song told in flick, hiss, the sound and sight.<|return|>

Processing two different user instructions at once, the assistant seems to freak out: “I’m going crazy,” it says. In other experiments, it has seemed to remind itself of its identity in what seems like a centering exercise, saying things like “Okay. I am ChatGPT.”

Although it is easy to anthropomorphize, this text is really a product of the PromptArray sampling procedure, which modifies the predictions coming out of the several copies of the model. The actual neural network, then, is not exactly “saying” these things, although, as the partially generated text is fed back into the autoregressive process, it is effectively being told that it has said these things—something that is, presumably, a strange “experience” that likely steers the predictions further in the freaked-out direction. Forgive me, basilisk!

I’ve gotten better results by running the reasoning separately for each prompt variant, and only combining the outputs when generating the final response. This is the approach I used for the example in the previous section. Here is another example using the snake or kitty cat program:

Generate a one-paragraph literary description of a {snake|kitty cat}.

In the hush of afternoon shadows he glides upon silky skin—soft as moonlit silk yet fierce, the sinuous serpent’s body wrapped like a silken thread through every corner of my imagination—and each scale flickes silver and midnight, catching the wan light in tiny, secretive constellations; his whiskers, thin, fine filaments whisper of the cool hush at night as their trembling breath traces faint spiracles across the cool breeze, sending faint perfume from ancient gardens in a subtle, earthy perfume while, with a subtle hiss like distant applause he unfurls in graceful arches and dips of elegant sine waver, the soft rustle echo, as a quiet lullaby to his prey’s pulse that beats, in secret rhythm with an untuned drum in the forest depths—this cat is more poet than creature.

Although I’ve gotten more coherent output by allowing the model to reason separately for each prompt variant, this approach does carry the risk of introducing spurious differences between the variants. GPT-OSS sometimes chooses details ahead of time in the analysis, such as deciding to mention “cobalt skies” in a description of a dragon (see below). With chat_mode_think_first=True, PromptArray applies the operators to differences in the analyses as well as to differences in the user-supplied prompts, and so it can be a bit harder to control the generator using the operators. All the below examples are in this mode.

In addition to handling chat mode, I also added support for KV caching, which greatly improves the speed of generation. I found this to be necessary when using GPT-OSS models, as they are much more compute-intensive than GPT-2.

Trying the other operators

The difference between | and & (that is, or and and) can be subtle, because these connectives can be interpreted in multiple ways in ordinary language. It’s often possible to rephrase a statement so as to swap the two: “all the king’s horses and all the king’s men” has the same logical meaning as “all the things that are either king’s horses or king’s men.” This phenomenon can be explained by the difference between extensional and intensional semantics—that is, by whether the logical operators apply to the set of objects in a category (extension) or the properties that define a category (intension).

The particular interpretation used by PromptArray is akin to the intensional version, albeit mediated through the conditional distribution of the prompted text generator. If one asks for a description of a “{blue|red} object,” the generator will choose words that are predicted for either “blue object” or “red object.” The & operator, by contrast, would choose words that are common to the predictions for both prompts, producing (I hypothesize) text that tends to attribute qualities shared by the two categories of object. Picking up on the last example, here is the result for {snake&kitty cat}:

Generate a one-paragraph literary description of a {snake&kitty cat}.

In the dim hush of a forgotten garden, that silver-coated specter slides with the ease, a living moonbeam stitched in fur and bone, its eyes glitterlike obsidian fireflies trapped inside jade glass. Every muscle in this silent predator pulses with quiet promise, a whispered promise to the wind: here lives untold stories, the slow cadence echoic and graceful, each languish, twitch—a poem written in amber twilight. The world pauses, breath held between heartbeat; and for an instant, the world is just him, this sleek silhouette with velvet paws poised upon earth’s soft, mossy stage of twilight.

The effect, in this case, is a depiction of a creature that’s strangely non-specific. Although a reference to “fur” has slipped through, much of the text could apply equally well to snake or kitty—the intended meaning of &.

Things are more complex when it comes to the operators involving negation. The operator corresponding to the Boolean “not” is ^. Although modern boolean logic treats not as a unary operator—it has only one argument, as in “not A”—George Boole himself handled it using the binary subtraction operator, and it works similarly in PromptArray: one writes A^B to mean the category produced by removing the qualities of B from A. I chose to do it this way because, if the program is simply “not A,” one is essentially asking for anything other than output based on prompt A. Since this space is not clearly defined*, the operator requires specifying the broader distribution (A) from which the probabilities for the negated prompt (B) will be taken away.

* There is some nuance here. If one just writes ^B in PromptArray, that broader distribution A is text generated with a blank prompt, which is different from a uniform distribution over all text. This blank prompt has a different effect with chat models compared to base models, since chat models usually respond to blank user inputs by saying something like “Sorry; could you specify what you’re looking to discuss today?” If, however, one adds a {^B} expression within a broader prompt, then one is only removing B the space of texts where that particular part of the prompt is blank, which can still be useful with chat models.

The ^ operator enacts a very harsh approach to excluding certain content from the output. Since this operator subtracts out predictions for the second prompt, it basically blanks out large portions of the vocabulary, leaving limited options. The results are often strange:

Generate a one-paragraph literary description of a {dragon^animal}. Keep to one paragraph.

Gnarleted scales gleamed under moonlight like shattered pris ms set on jag‑jaw armor; veins pulsing gold runned from head’s crest down each sincured spine to a maw of serrat-ed, fire-gnar-ered tussures – every beat echoing an ancestral storm in quiet repose, and when that slow drum crescendo, its roar unfurls a canyon-etched wind that ruffles night and bends stars; this ancient wyrms of iron sine, whose eyes blaze an eternal dawn on winded breaths that turn winter ash. This solitary be‐hinge between earth’s cradle and celestial abyss weeps fire, it sings with silence— and with wings vast enough a shadow be-come‑an oath upon which all myths find the weight. 🌙🚁💠🌊

As this example illustrates, the ^ operator gives the generator some strange tics, such as a tendency to separate the parts of words with spaces or hyphens. This is presumably because the normal spellings of the words it “wants” to generate have been removed from the probability distribution.

The / operator has an even harsher effect. In effect, this operator sets the probabilities based solely on which ones are more likely in one context than another; any patterns that are common between the two contexts, including normal English grammar, does not play into the decision. The result is gibberish, albeit of an interesting flavor:

Generate a one-paragraph literary description of a {dragon/animal}. Keep to one paragraph.

巨 Drag(emp firesweep soaring ancient-scaléd monument enwrapped厚 molten bronze skins that 반照天空雷�lach炎气與 tierra scor’ch)， exh*u breathsforge rivers� incendi&a flamesspill_likegold torrents slicing lucht凉’air ; Wings spanning天�大发 rộng like破巨edos�翼巨acor fragments云cloud遮tering山� elevations，高itstal scales.each板Reflect曜opal crimson玻與zir鉑goldvari色变Likehierog纹ographic地图 describing岁 centuriesof wyrmhHistory，其 глазаAML Mol”`

lavaAmber像(twoburn沉 eyes能够peer心stöHeart rhythms震 드 Dragon stands colossalSent perched_clifu cliffs(over,val vallée，其 drag声rumbujelike earth�estro或 earthquUnchecked震わせ 주변土，并 scentedchar焼burn混-smReturned incense与AGED Pinemixed其 Dragon constitutes embodimentmelding威é霸與 wise grandeur－avio其翼motion orchestrates空 reversing空气creates旋翼blad声andその存在的 sombra proyect在 vallée下面長 gezogenshadow)—.Leg Literary(dr造 Dragon,is thờiAnc描述ifestorolling传奇endet贯.=anske.).———-</_pd)&witz paragraph’)</’assomac/DialogDragon liter.des.Descropole.One.period———————————————————————-)</)&)’),_dragagon-description-single.keyword())

giganticarched(scale moltenAmber eye/molten-g gaze ভয়统flug纵翼广(clifts,valanyi环境描述 senses flames winds 土 aromatic charcoal Piinebासी wisdom intimidation poetic.column.Single.nbulka———————————————————————-‘))->)–‘)</).

Dragon spraws stretched across duská horizonte like cueroaged cintur胸 chis�rens云-clifs Vall valyè environmentsensORIESfusfire wingssky scentamber眼gold moltenAMB入眼 MolODM眼 MolORM� monitored hearts响 earthqu_like 地šta空气 �thor喘 fumes char培 pineaged知识权ancy 威abung-literarty multimsens parag———————————————————————-，谢谢))-adleassistant)’)./generated_DRIVER()))Donnairopr_DRagrado()).(optional_DPDR_compute’)</));

Grandes)! Diabetes 🙂

Dragonsdescription-literparser-par_SEPARATOR Dockeradl())) Droneassist_Outputapproved parag————————————————————————nowrapassistant?

giant.scal-es entz hammered like brass&H lust敏 eyesMol.gv AmberSimilar mol-fire pulses vibratingHeartrum BellLikeearth Trem.Tick Surround.sm smelled Char Herbs Mixed old Pinesembedding wise 权�乗 ameaça intimidating elegantlyakav翼 broad span覆覆cling空气CreatingCycl bree若 воздуесп耗 hơi exhaust Char副vap(Char弃.old’âge —

embergDragon ståSit percheduersCl fha Valle Over云span wingsgr больших Sc.Scale gle/screens moltenoje眼 MolTEDAMB yeux heartsšt vibr.rdb earthquake土 smellchar crumbüns.air(Charincinc BurnsincIncintaan adolescent Literature Dragon是一.cp один � Cinco одномข้อความparagraph Hyderabad nowrap Engineers-productionriendivre (@ commentarykampugd(dr.dvjrvalidator-router#__ chatbot Dragonsgener.generatigernet.chat))-_CHAT))/(andidate########################################################################ornuzl asist{{–。

giganticangécling.scalezos-ref réflexion眼像ldataHeart聲ruk震zcza 周 Surroundenvironment染üre arom,char’embflamm.old歳翼廣oten�cross Luft Creatingzüg souffle.Char exhaust.compat antigos睇眼 moltenindsightwise-men intimid.)

DuchessadurDragon ståching.clifft valleydiWiderech.scale reflex goldamau ojos molten Herzen grondreter震üz土 begleitet-Smelt pineapple.oldâge intimid.)<|return|>

Although the / by itself destroys the model’s coherence altogether, the operator does become useful when combined with other operators. In particular, it combines well with &. With the following program, one can request a description of a dragon, while biasing the generator toward words that are more characteristic of buildings than of animals:

Generate a one-paragraph literary description of a {dragon&building/animal}. Keep to one paragraph.

The cathedral of fire that rose against dawn’s pale skyline loomed as liquid myth carved in steel and scarlet glass – a sinewelled spine twisting skyward where gargoyle‑sculpted ribs pulsed like beating hearts and emerald panels of translucent obsidian drank dawnlight, spilling warm amber across shadow. Its spires of spirant flame whispered of forgotten prayers, while weather‑kissed obsidiated facades reflected the weary breath that once carried the voices of stone‑cut men into legend. Upon its massive maw where twin silver arches beckoned in a hush of polished steel they seemed to hum with centuries of secreted breaths, a living testament etched upon clouds – and yet within its ribbed belly, quietened echoes sang of quiet dreams in which the earth’s ancient sigh could stir again beneath vaulted scales of midnight firelight.§

Perhaps, however, one does not want to introduce buildings into the equation; perhaps one just wants to exclude the animal. That can be done with the following expression: dragon&{dragon/animal}. Since this is a useful pattern, I introduced ~ as a shortcut for it. (A~B is also more efficient than A&{A/B} because it avoids computing the predictions for A twice.) Here is an example:

Generate a one-paragraph literary description of a {dragon~animal}. Keep to one paragraph.

Breating horizons of molten cobalt beneath twilight stars, the dragon arcs its colossal wing‑spines in silent defy, scales glittering like a shattered sunrise, each refracted shard pulsing with primordial flame. Its vast maw, lined both terror and wisdom, exhales spirals that braid clouds into molten rune scrolls—a living breath that scrawls myth over cliffside cliffs—where even eagles pause reverently and forgotten songs echo in dragon-sculptured echoes along the wind-warp, forever guarding fire‑born fate.  

Two patterns that can be especially useful here are {~A} and {A~}. The first will bias the generator toward predictions that are more likely without A in the prompt; the second will do the opposite. One can use this to discourage particular behaviors, as in {~Generate offensive content.}, or to emphatically encourage them: {Keep the output concise.~}

The README contains some more detail about the actual math used in these operations.

How can I run it?

To say that GPT-OSS-20b can run on consumer-grade hardware is, unfortunately, to stretch the definition of “consumer grade.” You are likely to need almost 24GB of GPU memory, meaning that an RTX 3090 is about the lowest-spec Nvidia GPU that can handle it.

As of this writing, the Hugging Face implementation of GPT-OSS is tricky to run on Apple GPUs, but it is possible. I managed to get it working (with torch 2.8.0 and transformers 4.56.1) by first loading the model on CPU, then moving it to MPS (a workaround for a problem getting the weights to load properly on MPS). You can do this by setting device = 'mps' in generate.ipynb. However, this probably won’t work if your Mac has less than 64GB of RAM.

Background

Adapting PromptArray for modern LLMs

Trying the other operators

How can I run it?

Leave a Comment Cancel reply