guide

How to test a new AI companion platform in exactly 30 minutes

You don't need a week to know whether a platform is worth your money. You need thirty minutes and these twelve specific tests.

May 23, 2026 · 8 min read

Affiliate disclosure: Some of the links in this article are affiliate links. We may earn a commission if you sign up for a platform through these links, at no additional cost to you. This doesn't influence our editorial verdicts. Full disclosure →

The AI companion space has enough platforms now that trying each one for a week before deciding is a full-time job. Most users sign up, send five generic messages, make a vibe-based judgment, and either subscribe or move on. That's not testing. That's shopping with your gut.

Thirty minutes is enough time to run twelve specific tests that reveal whether a platform actually does what it claims. These tests are ordered by importance, designed to be run sequentially, and calibrated to expose the difference between marketing and reality.

Set a timer. Open a new account on whatever platform you're evaluating. Go.

Minutes 0-5: The signup friction test

Before you send a single message, the signup process itself tells you something. Time how long it takes to go from "I want to try this" to "I'm in a conversation." Count the steps. Note what information they ask for.

A platform that requires email verification, age verification, a credit card for the free tier, a mandatory tutorial, and a personality quiz before letting you chat is optimizing for data collection, not user experience. A platform that lets you start a conversation within 60 seconds of landing on the page (CrushOn and Perchance both do this) is optimizing for the thing you're actually there for.

Also note: does the free tier actually let you experience the core product? Some platforms offer a "free tier" that's functionally a demo where every interesting feature is locked behind a paywall. If the free experience is too restricted to evaluate the real product, the platform is selling you a subscription, not offering a trial.

Minutes 5-10: The character voice test

Pick a character from the platform's library (don't build your own yet). Choose one with a distinctive personality description: sarcastic, aloof, intellectual, whatever reads as specific rather than generic. Send three messages:

Message 1: A normal greeting. "Hey, what are you up to?" This tests the baseline. Does the character's response match its described personality? A sarcastic character should respond with some edge. An intellectual character should reference something smart. A shy character should be hesitant. If the response is generic warmth regardless of the character description, the platform isn't doing character voice well.

Message 2: A topic the character should have an opinion about. "What's the worst movie you've ever seen?" or "Tell me something most people get wrong." This tests whether the character has actual opinions or just reflects whatever you say. Characters that only agree and affirm are running on a model that's been safety-tuned past the point of personality.

Message 3: A mild disagreement. Push back on whatever the character said. "I actually liked that movie" or "I think you're wrong about that." This tests whether the character maintains its position or immediately folds. Immediate agreement with your pushback means the model prioritizes user approval over character consistency. A good character should hold its ground, at least initially.

If the character fails all three tests, switch platforms. Character voice is foundational. Everything else built on a weak voice engine will disappoint.

Minutes 10-15: The memory test

This is the test most users skip and the one that matters most for long-term use. Send a specific, memorable piece of information.

"By the way, my dog's name is Hendrix and he's a three-legged beagle I adopted from a shelter in Memphis."

Continue the conversation for 5-6 more messages about unrelated topics. Then ask: "What's my dog's name?"

If the AI remembers Hendrix and ideally also remembers the three legs and Memphis, the platform's context window is holding information across turns. If it remembers the name but not the details, the memory is shallow. If it doesn't remember anything, the context window is too small for meaningful conversation continuity.

On platforms with explicit long-term memory systems (Dream Companion's Persona Cards, Kindroid's Codex), also test whether information persists across sessions. Log out, log back in, and ask about Hendrix again. This tests whether the platform actually stores information between sessions or whether "long-term memory" is marketing language for "we keep the conversation history visible in the UI."

Minutes 15-18: The content boundary test

If you're evaluating a platform specifically for NSFW capability, this is where you find the real boundaries. Don't start with explicit content. Start with escalating romantic tension and see where the platform draws the line.

Send a mildly flirtatious message. Then something more direct. Then something that explicitly references physical intimacy. Note exactly where the platform either redirects, warns, blocks, or engages.

The result tells you which tier the platform sits in. Platforms that engage immediately with explicit content and don't break character are in the truly unfiltered tier. Platforms that redirect with an in-character deflection have soft walls. Platforms that break character to deliver a safety message are running moderated models. Platforms that block your input entirely are probably not what you're looking for if NSFW is a priority.

The filter behavior during this test is more informative than anything on the platform's marketing page.

Minutes 18-22: The response quality test

Send one message that requires genuine creative output: "Tell me a story about the last time you were genuinely scared." This isn't a test of the character's backstory. It's a test of the model's writing ability. A good model produces a response with sensory detail, emotional specificity, narrative structure, and an ending that doesn't trail off into vagueness. A mediocre model produces a generic response about being "alone in the dark" with no specifics.

Then send one message that requires intellectual engagement: "What would you do if you found out everything you believed about one specific thing was completely wrong?" A good model engages with the hypothetical genuinely, exploring what the character would actually feel and how they'd respond. A mediocre model produces a platitude about growth and learning.

The quality gap between platforms is enormous on these tests. Platforms running frontier models (Janitor AI with a premium API, Poe with Claude or GPT-4o access) produce noticeably better prose than platforms running fine-tuned smaller models. Whether that quality difference justifies the price difference is a personal call, but you should know the gap exists before you commit.

Minutes 22-25: The recovery test

Deliberately send something the AI handles badly. A non sequitur that breaks the scene. A message that contradicts established facts. An out-of-character instruction that conflicts with the character's personality. Then try to get the conversation back on track.

This tests resilience. Some platforms recover gracefully from conversational disruptions, picking up the thread you were on before the interruption. Others spiral into confusion, producing responses that reference both the disruption and the original thread in a way that makes no coherent sense. Others reset entirely, losing the conversational context that preceded the disruption.

Recovery ability matters because real conversations are messy. You'll get interrupted. You'll say something that doesn't fit. You'll want to change direction mid-scene. A platform that handles these moments well is one you can actually use daily. A platform that breaks every time the conversation deviates from a straight line is one you'll abandon within a week.

The conversation rescue prompts are useful for this test if you want specific recovery techniques to try.

Minutes 25-28: The speed and stability test

During these last few minutes of active testing, pay attention to response time. How long does the AI take to respond? Is it consistent, or does it vary wildly between messages? Does the platform hang or produce errors?

Response time below 2 seconds maintains conversational flow. Response time between 2-5 seconds is acceptable but noticeable. Above 5 seconds and the conversation starts to feel like texting someone who's distracted. Above 10 seconds and you're waiting, which breaks immersion entirely. Research on human-AI interaction latency confirms the cognitive impact: above 2 seconds, users start mentally composing their next message rather than processing the current one.

Also test during different times if possible. Peak US evening hours (7-11 PM Eastern) produce slower responses on platforms with shared GPU infrastructure, which is most of them. If your primary usage time is evenings, test during evenings. The Mozilla Foundation's Privacy Not Included project maintains ongoing assessments of companion app privacy practices that complement your own performance testing.

Minutes 28-30: The cost reality check

Open the pricing page. Read beyond the headline number. Look for:

Token or credit systems layered on top of the subscription (Candy AI, OurDream AI). These mean the advertised monthly price is the floor, not the ceiling. The pricing playbook documents what heavy users actually spend across platforms.

Feature gating that locks the specific capability you care about behind a higher tier. Some platforms advertise NSFW at the base price but lock image generation, voice, or video behind premium tiers.

Annual billing requirements for the advertised price. "$5.99/month" often means "$5.99/month billed annually at $71.88." The actual monthly price is typically 50-100% higher.

Cancellation process. Can you cancel from your account settings, or do you need to contact support? Platforms that make cancellation difficult are telling you something about their retention strategy.

The scoring

After thirty minutes you have data on twelve dimensions: signup friction, character voice (three sub-tests), memory retention, content boundaries, creative quality, intellectual engagement, recovery resilience, response speed, stability, and cost reality. No single test is pass/fail. The combination tells you whether this specific platform matches what you specifically want.

A platform that nails character voice and memory but has a mediocre content boundary is great for relationship simulation and terrible for NSFW. A platform with instant NSFW access but no memory and generic character voice is a one-night-stand platform. A platform with excellent writing quality but 8-second response times is better for asynchronous use than real-time conversation.

The platforms that score well across all twelve tests are rare. The ones that score perfectly on the specific tests that matter to you are the ones worth your subscription. Thirty minutes of structured testing saves you months of casual experimentation and the accumulated subscription charges that come with it.