All courses > Technology and Programming > Artificial Intelligence and Machine Learning ::

Generative AI Basics: Creating Text, Images, and Audio from Patterns

Capítulo 8

Estimated reading time: 12 minutes

What “Generative AI” Means in Practice

Generative AI is a type of AI that produces new content—such as text, images, or audio—by continuing patterns it has learned from many examples. Instead of only choosing from a fixed list of answers, it can create a fresh output each time: a paragraph, a picture, a melody, a voice reading, a summary, or a set of ideas.

A helpful way to think about it: generative AI is a “pattern completion engine.” You provide a starting point (a prompt), and the model predicts what should come next based on patterns it has learned. The output can look creative, but it is still grounded in learned patterns and probabilities.

Generative AI is commonly used for: drafting and rewriting text, brainstorming, translating and simplifying writing, generating images from descriptions, editing images, creating voiceovers, producing sound effects, and generating music-like audio. In many tools, the same underlying idea applies across all media: the model turns your prompt into a sequence of small units and predicts the next units until it forms a complete result.

How Generative Models Create Text

Text generation models work with tokens. A token is a chunk of text (often a word or part of a word). When you type a prompt, the model converts it into tokens. Then it repeatedly predicts the next token that best fits the context. After it picks a token, it adds it to the context and predicts the next one, continuing until it reaches a stopping point.

This explains several behaviors beginners notice:

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

They can sound confident even when wrong: the model is optimized to produce plausible text, not guaranteed truth.
They can be sensitive to wording: small changes in your prompt change the context, which changes the most likely continuation.
They can be steered: you can guide style, format, and constraints by specifying them clearly.

Key knobs: randomness, length, and constraints

Most generative text tools expose settings (sometimes hidden) that affect output:

Temperature (randomness): lower values make outputs more predictable and consistent; higher values increase variety but can reduce reliability.
Max length: limits how long the output can be.
Stop sequences: tells the model when to stop (for example, stop when it reaches “###”).
System/role instructions (in some tools): higher-priority instructions that define behavior (tone, safety, format).

Even if you never touch these settings, you can simulate them with prompting: ask for “one concise paragraph” (shorter), or “give 10 varied options” (more variety), or “use this template exactly” (more constrained).

How Generative Models Create Images

Image generation models create pictures from patterns in visual data. Many modern systems use a process that starts with noise (a random static-like image) and gradually transforms it into a coherent image that matches your prompt. This is often described as “denoising.”

In simple terms: the model learns how images relate to text descriptions. When you provide a prompt like “a realistic photo of a red bicycle leaning against a brick wall at sunset,” the model tries to produce an image whose visual patterns match that description: shapes, colors, lighting, textures, and composition.

Common image prompt ingredients

Image prompts usually work best when you include:

Subject: what is in the image (person, object, scene).
Style: realistic photo, watercolor, 3D render, pencil sketch, etc.
Composition: close-up, wide shot, top-down, portrait orientation.
Lighting: soft daylight, studio lighting, golden hour, dramatic shadows.
Details: materials, colors, background elements.

Many tools also support negative prompts (what you do not want), reference images, or editing modes (inpainting/outpainting) where you modify part of an existing image.

How Generative Models Create Audio

Audio generation can mean different things:

Text-to-speech (TTS): turning text into spoken audio.
Voice conversion or voice cloning (where allowed): changing a voice style while keeping the words.
Sound effects generation: creating sounds like rain, footsteps, or a door closing.
Music generation: producing instrumental tracks or melodies.

Like text and images, audio is represented as patterns—such as waveforms or spectrogram-like representations—and the model generates new audio that matches the prompt and constraints (duration, mood, tempo, voice style, background noise level, and so on).

For beginners, the most practical starting point is text-to-speech: you write a script, choose a voice style, and generate a clean voiceover. The same prompting principles apply: specify tone, pacing, pronunciation hints, and formatting.

Prompting Basics: Getting Useful Outputs Reliably

A prompt is not just a question. It is an instruction plus context. The more clearly you define the task, the more likely you get a usable result on the first try.

A simple prompt formula

Goal: what you want.
Context: who it’s for, what it’s about, what constraints exist.
Format: bullet list, table, JSON, step-by-step, script, etc.
Quality bar: tone, reading level, length, do/don’t rules.

Example prompt for text:

Task: Draft a customer support reply. Context: The customer reports a delayed shipment for order #4821. Constraints: Apologize, give next steps, do not promise a delivery date. Format: 1 short paragraph + 3 bullet points. Tone: calm, professional, friendly.

Example prompt for images:

Realistic photo, close-up of a ceramic coffee cup on a wooden table, morning window light, shallow depth of field, soft shadows, warm color palette, cozy kitchen background, no text, no watermark.

Example prompt for TTS:

Read the following script as a friendly instructor. Pace: medium-slow. Tone: encouraging and clear. Pronounce “SQL” as “sequel.” Avoid dramatic emphasis. Script: [paste script]

Iterate: treat the model like a collaborator

Generative AI often works best in short cycles:

Generate a draft.
Review and mark what is wrong or missing.
Ask for a revision with specific changes.
Repeat until it meets your needs.

Instead of saying “make it better,” say “reduce the length by 30%, keep the same structure, replace jargon with simpler words, and add one concrete example.”

Practical Step-by-Step: Creating Text You Can Actually Use

Step 1: Define the output and audience

Before prompting, decide what “done” looks like. For example: “A 200-word product description for beginners, written in plain language, with 3 bullet benefits and a short call-to-action.” This prevents the model from guessing your preferences.

Step 2: Provide the minimum necessary facts

Generative AI can fill gaps with plausible-sounding details. To reduce mistakes, supply key facts explicitly. For a product description, include: product name, target user, main features, constraints (no medical claims), and any required phrases.

Step 3: Ask for a structured draft

Structure makes outputs easier to verify and edit. Example:

Write a 200-word description for “BreezeBottle,” a reusable water bottle for office workers. Include: 3 bullet benefits, materials (stainless steel), capacity (750 ml), and a short call-to-action. Avoid health claims. Use simple language.

Step 4: Verify and correct

Check for factual accuracy, missing constraints, and tone. If it invented a feature (for example, “keeps water cold for 48 hours” when you never said that), correct it and ask for a rewrite:

Rewrite the description. Do not mention insulation performance because it is unknown. Keep the same structure and length.

Step 5: Create variations for different channels

Once you have a correct base version, ask for channel-specific versions:

Website version (longer, more detailed)
Social post (short, punchy)
Email snippet (friendly, direct)
FAQ (question/answer format)

This is a strong use of generative AI: reformatting and adapting content while keeping the same facts.

Practical Step-by-Step: Generating Images from a Prompt

Step 1: Decide what the image is for

Is it a blog header, a product mockup, a background image, or an illustration for a lesson? The purpose determines composition, aspect ratio, and detail level.

Step 2: Write a clear prompt with visual specifics

Start with a simple prompt, then add details. Example for a course illustration:

Realistic photo of a person’s hands typing on a laptop at a desk, soft daylight from a window, notebook and pen nearby, minimal modern workspace, shallow depth of field, natural colors, no text.

Step 3: Generate multiple options

Image generation is inherently variable. Generate several candidates and choose the best composition. If the tool supports it, keep the same “seed” to make controlled changes.

Step 4: Refine with targeted edits

Instead of rewriting the whole prompt, adjust one variable at a time:

Too dark: “brighter exposure, softer shadows.”
Too busy: “clean background, fewer objects.”
Wrong vibe: “more professional, neutral color palette.”

If the tool supports inpainting, you can fix small issues (like an odd object) without regenerating everything.

Step 5: Check for practical usability

For real projects, verify:

Does it match your brand style (colors, mood)?
Is there space for UI elements (if used as a header)?
Are there visual artifacts that look unnatural?
Are there any unintended details that could confuse viewers?

Practical Step-by-Step: Creating Audio (Voiceover) from Text

Step 1: Write for listening, not reading

Audio scripts should be simpler than written articles. Use shorter sentences, clear transitions, and occasional repetition of key terms. Numbers and abbreviations should be written in a way that is easy to speak.

Step 2: Add performance directions

Many TTS tools respond well to instructions like pace, tone, and emphasis. Example:

Voice: warm and clear. Pace: medium. Add a short pause after each bullet point. Avoid sounding excited. Script: ...

Step 3: Generate a short test clip

Start with 10–20 seconds. Listen for pronunciation issues, unnatural pacing, or awkward emphasis. Fix the script (often easier than trying to force the voice to compensate).

Step 4: Fix pronunciation and rhythm

Common fixes include:

Spell out tricky names phonetically (for example, “EYE-oh-tee” for “IoT”).
Replace symbols like “/” with words like “per” or “slash.”
Break long sentences into two.

Step 5: Produce the full audio and export cleanly

Generate the full voiceover, then check for consistency across sections. If you need multiple clips (intro, lesson, outro), keep the same voice settings and script style so the audio feels uniform.

Important Limitations: Where Generative AI Can Mislead You

Hallucinations (made-up details)

Generative models can produce information that sounds correct but is not. This is especially common when you ask for specific facts, citations, or niche technical details without providing sources. A practical habit: treat generated factual claims as “draft notes” until you verify them.

Ambiguity and hidden assumptions

If your prompt is vague, the model fills in blanks. For example, “Write a policy for remote work” could assume a certain country, legal environment, company size, or job type. Reduce ambiguity by stating assumptions explicitly: “for a 20-person software company,” “for contractors,” “non-legal advice,” “focus on communication and security basics.”

Inconsistent formatting

If you need strict structure (for example, JSON, a checklist, or a template), you must specify it and enforce it. Ask the model to output only the format you want, and consider adding a validation step: “Before finalizing, check that you included exactly 5 bullets and no extra sections.”

Copyright and brand risks (practical viewpoint)

In real work, you should avoid prompting the model to imitate a specific living artist’s style or to reproduce recognizable branded characters and logos. For business use, prefer prompts that describe general styles (for example, “realistic studio photo,” “minimal flat illustration”) and keep your own brand assets separate and controlled.

Core Terms You’ll See in Generative AI Tools

Prompt

The input instruction or description you give the model. Prompts can include context, examples, formatting rules, and constraints.

Completion

The generated output that continues from your prompt. In chat tools, each assistant message is a completion.

Context window

The amount of text (or tokens) the model can consider at once. If your conversation or document is too long, earlier details may be forgotten or summarized away. A practical workaround is to restate key constraints near the point where they matter.

Tokens

Chunks of text used internally by text models. Token limits affect how much you can input and how long the output can be.

Temperature

A setting that controls randomness. Lower temperature tends to produce more consistent, conservative outputs; higher temperature increases variety.

Seed (common in image generation)

A number that helps reproduce the same random starting point. Keeping the same seed can make results more consistent while you tweak the prompt.

Guidance / prompt strength (common in image generation)

A control that affects how strongly the model follows the prompt versus generating more freely. Higher guidance usually follows the prompt more closely but can reduce naturalness.

Upscaling

Increasing image resolution while trying to preserve detail. Some tools also “enhance” details during upscaling.

Inpainting / outpainting

Editing part of an image (inpainting) or expanding beyond the borders (outpainting) using generative methods.

Text-to-speech (TTS)

Generating spoken audio from text. Often includes controls for voice, tone, pacing, and sometimes emotion.

Mini Workflows You Can Reuse (Text, Image, Audio)

Workflow A: Turn rough notes into a polished explanation

Use this when you have messy ideas and want a clear paragraph.

Here are my notes: [paste notes]. Task: Rewrite as a clear explanation for beginners. Constraints: keep all facts, remove repetition, add one simple example, 150–200 words, friendly tone.

Workflow B: Create an image brief before generating images

Instead of jumping straight to image generation, ask for a brief you can approve.

Create 3 image concepts for a lesson about generative AI creating text, images, and audio. For each concept, provide: subject, setting, composition, lighting, and a final image prompt. Constraints: realistic, no text in the image.

Workflow C: Script → voiceover with quality checks

Use this when you need audio that sounds natural.

Task: Convert the following lesson text into a voiceover script. Constraints: short sentences, add stage directions in brackets for pauses, keep under 2 minutes. Then list 5 words that might be mispronounced and suggest phonetic spellings.

Now answer the exercise about the content:

Which prompt revision best reduces the risk of the model inventing product details while keeping the output easy to check?

You are right! Congratulations, now go to the next page

You missed! Try again.

Providing minimum necessary facts, adding constraints, and requesting a structured draft helps prevent plausible-sounding made-up details and makes verification and correction easier.

Next chapter

How to Evaluate AI Demos and Claims Without Coding

73%

AI Fundamentals for Absolute Beginners: Concepts, Use Cases, and Key Terms

New course

11 pages