
Pedestrian Detection on E-Scooters: Why It's Harder Than It Looks
Nearhuman Team
Near Human builds intelligent safety systems for micromobility — edge AI, computer vision, and human-centered design. Based in Bristol, UK.
At 15 mph, a scooter rider travelling toward a pedestrian who steps off a kerb two metres ahead has approximately 300 milliseconds before contact. A human reaction time is 150 to 250ms on a good day. The system warning the rider has to detect, classify, and alert in under 50ms just to leave enough time for a reaction. Now do that at dusk, in rain, with a camera mounted on a vibrating handlebar pointed at a cluttered street scene.
The pitch for computer vision on scooters sounds clean: put a camera on the vehicle, run a detection model, warn the rider when someone is in the path. The reality is that pedestrian detection in low-speed urban environments is one of the harder computer vision problems in the field, not one of the easier ones. Autonomous vehicle researchers have spent billions on it at highway speeds with large sensor suites. The micromobility version has to solve a harder geometry problem with a fraction of the compute budget.
Why Urban Street Geometry Defeats Most Detection Models
Standard pedestrian detection models are trained and benchmarked on datasets like COCO and Cityscapes, which skew toward car-height camera positions, good lighting, and pedestrians occupying a meaningful portion of the frame. A handlebar-mounted camera on a scooter sits roughly 90 to 110 centimetres off the ground. At that height, a pedestrian five metres away fills around 15 percent of a typical field of view. At two metres, they fill 40 percent but are already dangerously close. The model has a very short window in which the subject is both visible enough to detect and far enough away to matter. Add partial occlusion from parked cars, bollards, or other riders, and false-negative rates in off-the-shelf models climb sharply. Some public benchmark models that perform at 92 percent accuracy on standard datasets drop to 71 to 78 percent in real urban scooter-height test conditions, based on academic evaluations of embedded vision systems in comparable environments.
The nuance here matters. A 78 percent detection rate sounds bad until you compare it to zero, which is what most scooters have today. The relevant question is not whether the system is perfect, it is whether it provides meaningful warning time in the scenarios that cause the most harm, specifically children, cyclists crossing perpendicularly, and pedestrians stepping from behind obstacles. Those cases are disproportionately represented in injury data, and they are also the cases where detection geometry is worst, because the pedestrian enters the camera's field of view at close range with little warning.
The Compute Trap: Why You Can't Just Run a Bigger Model
The instinct is to solve the accuracy problem with a more capable model. The constraint is that edge hardware does not allow it, at least not without accepting latency that defeats the purpose. A Hailo-8 neural processing unit can run a lightweight detection model at 30 frames per second with around 40ms end-to-end latency. Swap in a larger, more accurate model and that latency climbs toward 90 to 120ms, which sounds fine until you are doing the geometry at 15 mph and realise you have just lost a metre of stopping distance. NVIDIA Jetson Orin modules push the performance envelope further, but they draw three to five times more power, which is a serious constraint on a battery-powered vehicle with a 25 to 40 mile operational range. The engineering is a negotiation between accuracy, latency, and power draw, and none of the three can be optimised without hurting the others.
Lighting is the variable that gets least attention in product specifications and most attention in real-world failure analysis. Camera sensors on a moving vehicle deal with rapid exposure changes: a shaded alley opening onto a sunlit square, headlights at night, the flat grey of an overcast afternoon. Models trained on daytime data degrade measurably at dusk, where contrast drops and pedestrian silhouettes blur against complex backgrounds. Some teams are exploring thermal imaging as a complement to RGB cameras precisely because thermal sensors are lighting-invariant, detecting body heat rather than reflected light. The hardware cost is higher, around 80 to 150 pounds per unit at current volumes, but for operators in cities with significant night riding, the case is worth making.
The hardest part of pedestrian detection on a scooter isn't the AI. It's the 300 milliseconds you don't have.
Emergency rooms in Oregon and Florida are seeing more micromobility injuries, and school campuses are banning devices they cannot control. The detection problem will not be solved by a single breakthrough model. It will be solved by teams willing to instrument real vehicles, collect data at actual handlebar height, in actual weather, and iterate on the uncomfortable gap between benchmark accuracy and street-level performance. The researchers and engineers doing that work quietly, on production hardware, in ordinary cities, are the ones who will actually move the injury numbers. The benchmark leaderboards will not.
Frequently Asked Questions
What frame rate does an e-scooter camera need for reliable pedestrian detection?
At 15 mph, a minimum of 20 to 30 frames per second is generally required to avoid missing a pedestrian who enters the field of view between frames. At 25 mph, 30 fps becomes a firmer lower bound. Below 15 fps, the inter-frame gap at scooter speeds is large enough that a pedestrian stepping off a kerb can appear inside the danger zone with no prior detection frame.
Why is pedestrian detection harder on scooters than on cars?
Camera height is the primary factor. Car-mounted sensors typically sit 1.2 to 1.5 metres off the ground and benefit from well-curated training datasets at that height. Scooter cameras at 90 to 110 centimetres see a fundamentally different perspective, with more ground clutter, lower contrast against backgrounds, and smaller pedestrian bounding boxes at operationally relevant distances. Most off-the-shelf detection models are not fine-tuned for this geometry.
Can a scooter safety system work without sending video to the cloud?
Yes, and for both privacy and latency reasons it generally should. On-device inference runs the detection model locally on the vehicle's processor, generating structured event data rather than raw video. This means no personally identifiable imagery leaves the device, latency stays under 50ms, and the system works in areas with no mobile signal. The trade-off is that model updates require physical access or a secure over-the-air firmware process, rather than simply updating a cloud endpoint.
Sources & References
- COCO and Cityscapes dataset documentation and benchmark methodology
- Hailo-8 product specification sheet, Hailo Technologies
- NVIDIA Jetson Orin power consumption and inference benchmarks
- Academic literature on embedded pedestrian detection at non-standard camera heights
- Oregon Health Authority and Florida legislative safety reporting, 2025
- Thermal imaging for pedestrian detection: Photonics Spectra research context
- Human reaction time research: 150-250ms range from transport safety literature
Nearhuman Team
14 Apr 2026