Blogs / Text-only SEO is Shrinking. Search is Multimodal Now

Text-only SEO is Shrinking. Search is Multimodal Now

Ridam Khare

13 Sep 2025 · 7:25 AM

Remember when SEO just meant stuffing keywords into a page? Those days are long gone. The digital landscape is shifting dramatically, and if you’re still focused solely on text-only SEO, you’re falling behind. Today’s search environment is vibrant, dynamic, and increasingly multimodal. Users aren’t just typing queries anymore. They’re speaking to devices, snapping pictures of products, and expecting instant, rich responses that blend text, images, video, and interactive elements.

This fundamental shift is reshaping how content gets discovered, and smart marketers are already adapting their strategies.

Let’s explore why text-only SEO is shrinking and why embracing multimodal search isn’t optional anymore.

Project Astra = Google’s Declaration of War on “Typed Search”

Search is evolving from something you do into something that happens around you.

Think about how you searched for information a decade ago. You’d open a browser, navigate to Google, and deliberately type words into a box. That model is rapidly becoming outdated.

Search is transforming from an intentional activity into an ambient capability that surrounds us, anticipating needs before we even articulate them. Google’s latest AI advancements go far beyond simple keyword matching, creating systems that grasp context, entities, and conceptual relationships rather than just matching text patterns.

This shift presents a major challenge for businesses still clinging to traditional text optimization techniques. In this new world, pure text SEO simply isn’t enough anymore. The algorithms now reward helpful, well-structured information while penalizing outdated keyword tricks.

Project Astra = Google’s real-time, multimodal AI agent.

Project Astra represents Google DeepMind’s bold vision for what search becomes when it’s no longer constrained to text. This universal AI assistant processes multiple input formats simultaneously and responds intelligently across modalities.

Unlike traditional Google Search, Astra:

Sees real-world objects through your device’s camera. Point your phone at a plant, and it instantly identifies the species and offers care tips.
Hears and understands natural conversations, detecting different accents, languages, and even emotional tones.
Remembers what you’ve shown it and what you’ve discussed previously, maintaining context between interactions.
Responds with whatever format makes the most sense for your query, whether that’s text, voice, visuals, or some combination.

The implications are profound: typing queries into a search box will no longer be the primary mode of information discovery. Instead, we’re entering an era where search happens contextually around us, integrated seamlessly into our environments and activities. [AI Developer Code]

Google’s AI Search Mode = The Death of “Blue Links”

The search results page is no longer a list. It’s a conversation.

The familiar Google search results page with its “10 blue links” is undergoing a radical transformation. AI Mode in Search reimagines how we interact with information online. Rather than scanning through a static list of websites, users now engage in dynamic, conversational discovery.

This new approach uses what Google calls “query fan-out” technique, breaking complex questions into subtopics for deeper investigation across the web. The result feels less like using a search engine and more like consulting a knowledgeable assistant who understands the nuances of your questions. [Google]

For users, this creates a more intuitive experience that mirrors how we naturally seek information. For content creators, it means rethinking how information is structured and presented to align with these conversation-based discovery flows.

The “10 blue links” model?

Google’s AI Overview fundamentally changes how search results are delivered. Instead of a sequential list of websites to visit, users now receive a rich, integrated surface that blends text summaries, relevant images, video clips, shopping links, and contextual information.

This transformation reflects a critical insight: users don’t primarily want websites; they want answers. People are seeking solutions to problems and guidance for decisions, not necessarily websites to visit.

This creates an urgent imperative for content creators. If your information doesn’t exist as clear, extractable insights that AI can understand and incorporate into these blended results, it risks being skipped entirely. The focus shifts from optimizing whole pages to ensuring your content contains valuable nuggets that AI systems can easily identify and surface.

The Fragmentation of Discovery Surfaces

Search now lives in hundreds of micro-engines:

Traditional SEO operated on a single battlefield: Google’s Search Engine Results Pages (SERPs). Companies mastered keywords, backlinks, and on-page techniques to climb those rankings. But that singular focus is becoming increasingly outdated.

Today’s digital landscape features hundreds of “micro-engines” with their own algorithms, priorities, and formats. Text is just one piece of the puzzle, and often not the most important one. People don’t just read anymore. They speak, scan, swipe, watch, and snap. [Mavlers]

If your content strategy doesn’t speak multiple “languages” across visual, video, semantic, and contextual dimensions, then search engines increasingly won’t speak to you either. This fragmentation demands a completely different approach to visibility and audience connection.

Fractured channels of specialized discovery

The discovery landscape has fractured into specialized channels:

AI Assistants: Gemini, ChatGPT, Microsoft Copilot, and Perplexity are becoming primary information gateways, generating conversational responses that synthesize information from multiple sources.
Social Search: Platforms like TikTok, Instagram, and Reddit have evolved into powerful search engines. Many younger users bypass Google entirely, heading straight to these platforms for information discovery.
Vertical Engines: Amazon for products, YouTube for videos, and Pinterest Lens for visual discovery each represent massive search ecosystems with unique ranking factors.
Ambient Search: Project Astra, AR search capabilities, and wearable technologies create interfaces where search happens contextually in our physical environment.

Each of these channels requires specific optimization strategies that go well beyond traditional text-based SEO. [Search Engine Land]

Discovery is no longer centralized. It’s fragmented and multimodal.

This fragmentation isn’t just a temporary shift in user behavior. It’s a fundamental paradigm change in how information discovery works. Google’s declining share of overall search activity isn’t just a blip; it signals a permanent transformation in the discovery ecosystem.

The new reality is that discovery is simultaneously becoming more fragmented across platforms and more multimodal in nature. This requires a completely different approach to content creation and optimization, one that embraces multiple formats and discovery channels rather than focusing exclusively on text.

Intent Is No Longer Text-Bound

People don’t just search. They show, ask, speak, point, record, draw.

User intent used to be expressed primarily through typed words. But human curiosity and information needs are far more complex than what can be easily typed into a search box. Today’s technology is finally catching up to the full spectrum of how people naturally express their questions and needs.

People now have multiple ways to indicate what they’re looking for:

Taking a photo of an unknown object to identify it
Speaking naturally to a voice assistant
Drawing a sketch to find similar images
Recording a snippet of music to discover the song
Pointing their phone at a landmark to learn its history

This multidimensional expression of intent creates both challenges and opportunities for content creators who have traditionally focused solely on text optimization.

Search intent now manifests in formats beyond text

Intent has broken free from text-only constraints and now flows through multiple channels:

Visual intent: Instead of typing “what is this plant in my garden,” users can simply take a photo and Google Lens will identify it. This visual expression of intent bypasses text entirely while delivering more precise results.
Conversational intent: Rather than constructing keyword-optimized queries, users can ask AI assistants questions in natural language, complete with follow-ups and clarifications.
Experiential intent: With AR overlays via systems like Project Astra, users can literally “see” answers in their physical context, pointing their device at objects to reveal information layers.

Algorithms are format-agnostic

Modern AI systems are increasingly format-agnostic. They don’t fundamentally care whether your question comes as text, an image, voice, or some combination. These algorithms determine the most appropriate response based on context rather than treating different input formats as entirely separate systems.

This represents a significant shift in how search engines work. Instead of having separate algorithms for text, voice, and image search, unified AI models now handle all these modalities simultaneously, choosing the best response format based on what will most effectively answer the user’s need.

Key takeaway: If your answers exist only in text, you’ll miss where future intent lives.

The crucial insight for businesses and content creators is that limiting your presence to text-only formats means missing vast swaths of user intent that now exists in other modalities.

If your content strategy doesn’t account for visual search, voice interaction, and multimedia discovery, you’re essentially invisible to a growing percentage of search activity. As multimodal interfaces become more prevalent, this blind spot will only grow larger for text-only strategies.

From Static Pages to Multimodal Knowledge Assets

Text-based blogs once worked because search engines could only read. But Gemini, Astra, and Perplexity can now see, hear, and summarize.

Traditional SEO focused on optimizing text-based web pages because that’s all search engines could effectively process. Early algorithms were essentially text-matching systems with increasingly sophisticated relevance calculations.

But today’s AI-powered search systems like Google’s Gemini, Project Astra, and Perplexity represent a fundamental capability shift. These systems can:

Process and understand images, diagrams, and video content
Listen to and interpret audio information, including nuances of tone
Synthesize information across formats to create comprehensive summaries
Extract key insights from content regardless of its original format

This evolution means that text-only content, while still valuable, now exists in an environment where multimodal assets have significant advantages in terms of discoverability and utility.

Future-ready content needs:

Content that’s prepared for this multimodal future requires a more diverse approach:

Text: Still essential for semantic richness and detailed explanation. Text provides the backbone of information but is no longer sufficient on its own.
Video: Particularly short, dynamic snippets that deliver information efficiently. These don’t need to be highly produced; authentic, information-rich clips often perform better.
Audio: Podcasts, voice summaries, and content optimized for voice assistants reach audiences who prefer listening to reading.
Structured metadata: This invisible layer helps AI systems understand and extract information from your content. Proper schema markup becomes increasingly important as AI decides what information to include in synthesized answers.

The most effective strategies now treat these formats as integrated components of cohesive knowledge assets that can be discovered across multiple modalities.

Insight: The battle isn’t for rankings anymore. It’s for representation inside AI-generated answers.

This shift creates a fundamental change in how success is measured. The traditional focus on search rankings is becoming less relevant as AI-generated answers increasingly stand between users and websites.

The new priority is ensuring your content gets represented within these AI-synthesized responses. Even when your content doesn’t receive a direct click, having your information included in AI-generated answers creates visibility and establishes authority.

This requires a different optimization mindset than traditional keyword targeting. Content needs to be structured for easy extraction of key points, provide clear answers to common questions, and deliver unique insights that AI systems will recognize as valuable additions to their synthesized responses. [Search Engine Journal]

Zero-Click Multimodal Search Will Dominate

SGE + Project Astra = answers before clicks.

The combination of Google’s Search Generative Experience (SGE) and Project Astra creates a powerful ecosystem where users receive comprehensive answers without ever needing to click through to websites. These AI search engines don’t simply display links; they actively synthesize videos, text, images, charts, and other content into unified, contextual outputs.

Google reports that people now use Lens for 12 billion visual searches monthly, a four-fold increase in just two years. [Google] This dramatic growth in visual search represents just one facet of the multimodal revolution transforming information discovery.

Astra takes this even further with its ability to understand and interact with the physical world through your camera lens. The system can identify context and respond accurately to what it sees, creating rich information experiences directly within the search interface. [Sify]

This integrated approach means users increasingly consume information directly within Google’s interface or through conversational AI, never actually visiting source websites.

Hot take: SEO teams obsessing over click-through rates are measuring the wrong metric. In a multimodal world, you optimize for answer inclusion, not page visits.

This evolution requires a significant shift in how we measure SEO success. Teams still fixated on click-through rates and website traffic as primary KPIs are essentially optimizing for a model that’s rapidly becoming outdated.

In a multimodal world dominated by AI-synthesized answers, the more relevant metrics revolve around answer inclusion rather than page visits. Success means having your information, data points, images, or video snippets incorporated into the AI-generated responses that users consume directly within search interfaces.

This doesn’t mean traffic becomes irrelevant. Deep engagement still happens on websites. But it does mean that visibility and influence can occur without traditional site visits, requiring new approaches to measurement that account for these “zero-click” impressions.

The Sensory Web Is Inevitable

“The web is evolving from a text-first index to a sensory-first network.”

The internet began as a text-based system, with images, audio and video gradually added as technologies evolved. But even as multimedia content proliferated, the underlying structure remained fundamentally text-centric. Links, metadata, and search algorithms primarily operated on text, with other media types as supplementary elements.

We’re now witnessing the emergence of what might be called the “Sensory Web,” a profound evolution where the internet’s structure itself becomes multimodal at its core. In this new paradigm, text is just one equal component alongside visual, audio, and interactive elements, all seamlessly integrated into a rich sensory network.

This transformation isn’t merely cosmetic or incremental. It represents a fundamental rewiring of how information is structured, discovered, and experienced online.

Multimodal AI = first glimpse of the Sensory Web:

The evolution of search follows a clear progression:

Keywords era: Search engines matched text strings, requiring exact keyword optimization.
Concepts era: Semantic search emerged, with algorithms understanding topics and relationships between ideas regardless of exact wording.
Experiences era: We’re now entering a phase where search delivers complete, contextual experiences that integrate multiple sensory inputs and outputs.

Today’s multimodal AI systems offer an early preview of this fully realized Sensory Web:

Point your phone at a plant and instantly receive species identification, care tips, and a video tutorial.
Ask your smart glasses about a building and see AR overlays with historical information and upcoming events.
Show a technical diagram to an AI assistant and get a real-time explanation of how the system works, with animated visuals highlighting each component.

While text-based SEO won’t disappear entirely, content strategies limited to text-only answers will increasingly find themselves buried beneath richer, more engaging multimodal experiences.

The future of search isn’t just about finding information; it’s about experiencing it. This sensory-first approach represents both the biggest challenge and the biggest opportunity for content creators in the years ahead. Those who adapt their strategies to embrace multimodal content creation will thrive, while those who remain text-only will gradually lose visibility in our increasingly rich and dynamic digital ecosystem.

INDEX

Loved the article?

Help it reach more people and let them benefit