cancel
Showing results for 
Search instead for 
Did you mean: 

When will we get object and image classification (Computer Vision) for Quest 3 and Quest Pro?

jeremy.deats
Protege

If I wanted to build a Mixed Reality app that can detect when a certain brand logo is visible on a poster, coffee cup coaster, etc... and then allow spatial anchoring relative to that logo there seems to be no way to achieve this today. Compute vision for Quest 3 and Quest Pro developers is limited to a very restricted list of "semantic classification" labels, all room architecture and furniture related objects (ceiling, floor, wall, door fixture, lamp, desk, etc..) full list here: https://developer.oculus.com/documentation/unity/unity-scene-supported-semantic-labels/?fbclid=IwAR3...

This also prohibits any kind of AR/MR training experience where some physical world object (e.g. a bulldozer operations panel) could be detected and spatial anchors augmented relative to specific control panel features to provide dialogs, etc.. all the things you'd expect from industrial AR applications. But this is not just useful for Enterprise/industrial AR, image and object classification is actually a core AR/MR feature required to build compelling experiences. Without this, we just have novelty use cases

Looking at the competition, I see Byte Dance is solving this but just allowing camera feed access on their Enterprise Pico 4. The retail version they block it. I doubt Meta will provide camera feed access as they are no longer selling Enterprise specific hardware and this would require a special firmware update to enable. 

Apple has provided camera access to iOS developers using ARKit for years, but for Vision Pro's ARKit implementation they are restricting camera feed access, however they are still providing image classification/detection and their computer vision/classification models, allowing developers to add their own images for recondition, here's a page from their docs.

https://developer.apple.com/documentation/visionos/tracking-images-in-3d-space

I am really surprised Quest Pro has been out almost a year and this sort or core AR/MR functionality is completely absent. With Quest 3 now released, more attention will be on AR/MR experiences, and Meta has great in house AI technology, they have computer vision models they could build a closed pipeline where the raw image feed is not accessible, but the classifier model is compiled and through a closed system the detection can happen in Unity3D or Unreal apps. 

Regardless of how they achieve it, this is so very important to future MR/AR type apps. Without it basically all you can do is very simple spatial anchoring, which may be suitable for novelty games but it's very restrictive and not reflective of the power of MR/AR. 

21 REPLIES 21

spacefrog
Honored Guest

Thanks Jeremy  for such an elaborate post about that topic
I figure Meta is  just slow shifting their pure consumer focused VR/XR approach over to some more business and technology orientated mindset. Or at least i hope so that this is happening

Consumer VR looks like its staying in its niche for a while longer i think

In every way the Quest 2 and Quest 3 are marketed as a game console product, with Quest 3 even including a voucher for a future released game.  I get that. But you can't do anything meaningful in AR/MR without computer vision. Quest 3 has computer vision, just restricted to common architecture and furnishings. 

Here's an example consumer application that the Quest 3 hardware is capable, but due to the limitations on computer vision can not be built-

Imagine an MR/AR experience where a standard deck of playing cards can be used and the headset uses image detection to determine what's in the players hand (King of Hearts, Seven of Clubs, Ace of Spades, etc...) the experience then teaches the user how to play various card games, spatially anchoring dialogs to specific cards to provide information and clues.

What described above can only be achieved if developers are given access to computer vision models, the company Vuforia provides an SDK and end-to-end solution, where they host the model. As a developer you can upload images for training for your experience. I could build the experience described above using Vuforia SDK and services and make it multi-platform, working on HoloLens, Magic Leap, or on an iPhone or Android phone.  But due to the API restrictions AR SDK vendors like Vuforia and even open tools built on OpenCV will not work with Quest Pro or Quest 3.

With native ARKit development I could build this for Apple's Vision Pro headset. I don't think a lot of people realize it yet, but this means the Vision Pro is going to be capable of delivering this entire broad array of AR type experiences through MR that Quest 3 is incapable of, all due to this restriction and lack of forethought into a CV pipeline. It's kind of assinine when you think about it. 

Without this there is no path to building true MR/AR experiences on Quest 3.  To me, this is the difference between the device being a toy and being taken seriously as a tool. I really hope they provide developers a path to at least perform image recondition. 

 

 

 


@spacefrog wrote:

Thanks Jeremy  for such an elaborate post about that topic


@spacefrog wrote:

Thanks Jeremy  for such an elaborate post about that topic
I figure Meta is  just slow shifting their pure consumer focused VR/XR approach over to some more business and technology orientated mindset. Or at least i hope so that this is happening

Consumer VR looks like its staying in its niche for a while longer i think




I figure Meta is  just slow shifting their pure consumer focused VR/XR approach over to some more business and technology orientated mindset. Or at least i hope so that this is happening

Consumer VR looks like its staying in its niche for a while longer i think




Meta has a big push for AI technologies, which is great. But computer vision like what I'm describing above is the essential AI technology for AR/MR that is entirely off limits to developers. 

 

Ivan_aa
Explorer

I agree, having access to the raw camera feed is very important for industrial/non-regular consumer focused apllications. If this could be accessed on a Quest 3 or Pro, even if we would have to pay some more for an enterprise version like with the Pico, this would open up the device to many more use cases. Robotics, education, medical, and many other fields would benefit a lot from this due to image processing and object recognition.

Because of the higher field of view (though the warping on the passthrough/video see-through could be improved), I could see this easily replacing development on the HoloLens or Magic Leap if raw video feed, and possibly point cloud data from the depth camera, were to be accessible. 

I can only assume this isn't currently available for developers because of privacy issues for regular consumers, but there should be a way to allow for this for enterprise use. Hopefully this will come in the future or else it is a missed opportunity.

Camera feed access is unavailable for privacy issues... Apple is also disabling raw camera feed access on the Vision Pro, but Apple does offer a pipeline where, through ARKit developers can add images and I believe objects as well for recondition. So Apple host an instance of the computer vision model and developers can train that model and have the ability to recognize images/objects and set spatial anchors on those images/objects in real-time... all of this without access to the raw camera feed on Vision Pro. Apple does this without a fee.  Apple developers do have to pay an annual fee of around $100 to be part of the Apple Developer program (one time annual fee, which covers all Apple devices), but there is nothing additional and no run-time expenses involved in using ARKit.

See:

https://developer.apple.com/documentation/visionos/tracking-images-in-3d-space

https://developer.apple.com/documentation/arkit/imageanchor

All Meta needs to do is find their own way of providing a counterpart to what Apple is providing through that API to open the door to a large range of MR/AR experiences. Quest 3 hardware is capable, just limited by software in this case. Developers need to be able to detect and perform spatial anchors at least on images detected in the environment. 

Meta has invested heavily in AI technologies, including computer vision. They have all the technology to build a closed pipeline without exposing the raw camera feed. You can only build a very limited range of AR/MR experiences without this. 

It would be a shame if Meta does make this closed pipeline exclusive to some Enterprise package, that would be closing the door on consumer apps being built that could benefit from this pipeline. Also I think it would be a mistake for them to try to commoditize the pipeline. If they do go that route, I hope they provide a gracious free tier so developers (including App Lab developers) can experiment and release great experiences. 

It really makes no sense to allow Apple's Vision Pro developers to build this entire range of apps that can't also be ported to Quest 3, but until Meta builds this closed CV pipeline or allows camera feed access, it really limits the Quest 3's MR/AR use cases to Meta's in-house apps and games that can only build spatial anchors to attach to and augment over basic room geometry and fixtures. It's one the key differences in the device being  a toy/console and a spatial computer. 

 

 

 

monsterbai
Explorer

Hi Jeremy,

I wonder if Vision Pro can achieve Object Detection function, for example, detection a basketball.

Thank you!

ARKit can for iOS, but from the documentation it appears Apple has only enabled the computer vision model to be trained by developers on images for RealityOS. 

Image Detection means it can only recognize some 2D pictures but can not recognize 3D object right? And will apple open its access to object detection in the near future. I truly agree with your points that about the limitations without these algorithms.

You have to realize how computer vision models are trained. There are different approaches to "object recondition" over just identifying an image, but the most common approach doesn't actually involve feeding depth sensor/geometry data to the model. Instead it works something like this, let's say the object you want to identify is coffee mug, a deep learning framework like TensorFlow, Keras or PyTorch might train 2000 images of coffee mugs taken at different perspectives and then it's classifier will do a reasonable job at identifying a coffee mug 

In order to do this depth of Computer Vision you really need access to the live camera feed from the device, which neither Apple nor Meta give you. 

What Apple is offering is a simplified version of image recondition, which is still super useful for many AR scenarios. The developer has to supply all the images to be recognized in the assets folder and they become part of the build, it's really setup for developer to supply one image for each item to be recognized, then programatically you can make Swift augment augment (place 3D rendering) on some point relative to the X,Y offset of the top, left corner of the image.... actually a cartesian coordinate system might be used, I'm not sure, but you can easily render like a 3D model of text showing the price of baseball card or something. 

Meta doesn't even give us that and the sad thing is Meta has ahead of Apple on the AI front, this should be deeply integrated.  

monsterbai
Explorer

Thank you very much for your explanation! That's really helpful.