Turning Test Drives Into a Research Loop | HazardCookie Blog Experiment

How bookmarks, redacted logs, route catalogs, and small comparisons turned subjective drives into usable evidence.

The project changed when the assistant became part of the loop. After a drive, I could describe what I felt: a jerky correction, a late stop, a creep event, a warning on the dash, a stretch that felt surprisingly good. The assistant could then help turn that into a concrete analysis pass: find the window, compare settings, inspect alerts, and update the route catalog.

The rule was simple: private data stays private. Private logs, identifiers, and generated analysis artifacts stay out of public posts. The public version can talk about patterns, methods, and redacted results. The private work can stay precise enough to be useful.

Metrics that made the drives less slippery

The analysis harness tracked signals such as lateral error, steering saturation, output-rate changes, driver steering time, no-lead speed deficits, lazy acceleration time, stop events, bookmarks, alerts, and warning windows. None of those numbers fully captures trust. Together they make it harder to fool yourself.

A route that feels smooth should leave traces: less steering correction, fewer saturated commands, cleaner lateral tracking, or fewer moments where the driver has to rescue the system. A route that feels scary should also leave traces: hard braking, disengagements, temporary steering faults, or bookmarks clustered around awkward behavior.

North Nevada v2 became the reference point

Across the early comparisons, North Nevada v2 stood out subjectively as the smoothest model for this setup. That did not solve everything. Stop-and-creep behavior still needed work, and steering smoothness still had rough edges. But a good reference point matters because it keeps the experiment from drifting across too many variables at once.

The loop became: drive, annotate, ingest, compare, patch, audit, repeat.

That loop also made failures less annoying. A bad route still helped when it produced a cleaner hypothesis. A weird alert became more useful once it could be traced to a temporary steering fault and a specific command handoff pattern. Evidence turns frustration into a next step.