“MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications” was submitted to arXiv in April 2017 by Andrew Howard and colleagues at Google. Its goal was not to top the accuracy charts but to make convolutional networks small and fast enough to run on phones and other resource-constrained devices.
The key technique is the depthwise separable convolution, which splits an ordinary convolution into two cheaper steps: a depthwise convolution that filters each input channel independently, followed by a 1 by 1 pointwise convolution that combines the channels. This factorization does most of the work of a standard convolution at a small fraction of the computation and parameters. On top of this, the authors exposed two simple knobs, a width multiplier that thins the network and a resolution multiplier that shrinks the input image, letting a developer dial accuracy down in exchange for speed and size to fit a specific device.
MobileNets became foundational for on-device computer vision, spawning several follow-up versions and shaping how the field thinks about deploying models at the edge rather than only in the cloud. The depthwise separable convolution it popularized also reappears in later efficient architectures.
For a business reader, MobileNets is the architecture that helped move AI off the server and into the camera, the app, and the sensor, where running locally means lower latency, lower cost, and better privacy.