Identifying and interacting with smart appliances has been challenging in the burgeoning smart building era. Existing identification methods require either cumbersome query statements or the deployment of additional infrastructure. There is no platform that abstracts sophisticated computer vision technologies to provide an easy visual identification interface, which is the most intuitive way for a human. We introduce CellMate, an responsive and accurate vision-based appliance identification system using smartphone cameras. We innovate on optimizing and combining the advantages of several of the latest computer vision technologies based on our unique constraints of accuracy, latency, and scalability. To evaluate CellMate, we collected 4008 images from 39 room-size areas across five campus buildings, making the size one order of magnitude greater than prior work. We also collected 1526 human-labeled images and tested them on different groups of areas. With existing indoor localization technologies, we can easily narrow down the location to ten areas and achieve a success rate of more than 96% within less than 60 ms server processing time. We optimized average local network latency to 84 ms and therefore expect around 144 ms total identification time on the smartphone end.