What is the Status of Apple AI Ecosystem

If there’s any real hot topic in 2024, AI is probably the number one hot topic, with OpenAI’s ChatGPT and Sora, Microsoft’s Copilot, and Google’s Gemini. Generative AI tools based on Large Language Models (LLMs) are popping up all over the place, and the speed of iteration is so fast it’s hard to keep up. The speed of iteration is overwhelming.

For Apple, however, the pace of following LLM doesn’t seem to be that fast. Compared to ChatGPT, Siri’s performance today has been underwhelming – it has almost zero contextual understanding, often encounters internet connectivity issues, and lacks accurate voice recognition. With rumors flying around that ChatGPT will be integrated into iOS 18 in the next couple of days, it looks like Apple is already lagging far behind in the AI space.

Although Apple’s exploration of AI has been uninterrupted since the release of Siri in 2011, with a considerable number of AI-related features integrated into the system. But in 2024, when generative AI is so hot, Apple’s performance is not outstanding. Apple has already mentioned AI as much as possible in its last two launches, as opposed to a year ago, when it never mentioned the AI scene.

Today’s post takes a look at the AI-related hardware and software systems and features mentioned at Apple’s recent conferences, and serves as an appetizer for the “Absolutely Incredible” WWDC 2024 in a month’s time.

The Apple-developed neural network engine accelerates the processing of specific machine learning models more efficiently than CPUs and GPUs, and is used in a wide range of devices, including iPhone, iPad, MacBook, and even the Apple Watch. Many of the artificial intelligence features of Apple’s platforms, such as device-side Siri, Dictation, AutoCorrect, Animoji, and computational photography, rely on real-time device-side acceleration without affecting the overall system response. Many of the AI features on the Apple platform, such as device-side Siri, dictation, auto-correction, Animoji, computational photography, and more, rely on the neural network engine to accelerate locally and in real time, without affecting the overall responsiveness of the system. With Core ML, third-party developers can also utilize on-device neural network engines to accelerate machine learning computations. For example, the App Store has a number of text-to-image apps that run locally.

As the capabilities of Apple’s platforms have become more and more intelligent, the neural network engines on devices have become faster and faster – from the dual-core neural network engine in the A11 in iPhone X and iPhone 8, which runs 600 billion operations per second, to the A17 PRO in iPhone 15, which can run up to 35 trillion operations per second.1 The neural network engine is undoubtedly one of the most important features Apple has created for its devices. The neural network engine is undoubtedly a key piece of hardware infrastructure that Apple has built for its “device-side intelligence. That’s why it never misses an Apple hardware event, and it’s always in the mix when it comes to processor performance. The smart features mentioned in this article will all rely to some extent on the neural network engine running locally on the device.

The camera defines one of the core experiences of the modern smartphone. In addition to the lens, sensor, and processor, there’s a lot of device-side intelligence involved in how the iPhone takes a photo. The data from the lens and sensors undergoes a series of calculations before it is finally presented to the eye. Once the photo is taken, there are a number of machine learning-related functions for face recognition and categorization, generating memories, automatically selecting wallpapers, extracting key information, and more.

iPhone Computational Photography

Deep Fusion, originally introduced on iPhone 11, uses a machine learning model to composite up to nine frames to improve low-light photos, reduce noise, and more. It’s a key feature of iPhone computational photography, and it’s updated every year. For example, Deep Fusion in the iPhone 14 lineup gains image pipeline optimization.

However, Deep Fusion is a system feature that’s turned on by default and can’t be turned off. If you use the camera app that comes with your Apple device, Deep Fusion is automatically turned on and processed for every photo. Many users have reported that photos processed with Deep Fusion look overly contrasty and oversharpened.

Apple ProRAW, introduced with the launch of iPhone 12 Pro, combines the information of the standard RAW format with the photo manipulation of iPhone computational photography to give users even more editing latitude. On iPhone 14 Pro, users can take 48-megapixel ProRAW photos with the new primary camera and combine it with iPhone’s machine learning capabilities for even greater photo detail.

Portrait Mode is another of iPhone’s computational photography features that highlights the subject and blurs the background by using depth information captured by the camera and calculated by a machine learning model. On iPhone 15, you don’t need to manually turn on Portrait Mode. Whenever iPhone detects a person or pet in the frame, the system automatically collects and calculates depth data, giving you the option to adjust whether or not to use Portrait Mode at a later time. Although Portrait Mode has been around for years, the results obtained from machine learning are sometimes less natural, often blurring out details around the edges of the subject, especially when it’s used for still life.

Similarly, machine learning related to deep information has also been involved in video shooting, as exemplified by the Movie Mode released with the iPhone 13 Pro. To summarize, Apple sees computational photography as having very strong machine learning properties, and it’s a key area of focus for Apple’s yearly updates. These technologies are also being applied to other devices, such as the Mac and Studio Display cameras, which now utilize Apple Silicon’s image signal processor and neural network engine to improve image quality.

Visual Lookup and Live Text

Visual Lookup is a photo subject recognition feature introduced by Apple at WWDC. Once recognized, you can find relevant photos directly by simply focusing your search to look for keywords. For example, if you type “cell phone” into a search, the Photos app automatically lists photos that it recognizes as containing a cell phone, and Apple has updated the subject extraction feature. Related functionality is also used on Apple TV and HomePod, with the HomeKit Secure Video feature, which recognizes what the HomeKit camera at your doorstep is seeing and sends out alert notifications. Live Text is another feature introduced at WWDC that recognizes text, web addresses, phone numbers, addresses, and other information contained in camera frames or in-system images, including images on web pages, photos in albums, screenshots, PDF files in Visitors, and more. At WWDC, Live Text also supports recognition of information on any pause frame in a video. I often use this feature on my Mac, especially when reading PDF files that have no text information. Overall usability is fair, with print being recognized more accurately than handwriting. This feature uses machine learning, but it’s also available on Intel-based Macs. In addition, the analytical processing of photos is applied to other aspects of the system, such as photo recall and intelligent suggestions. They use the information Visual Lookup provides about the scenes implicit in the photos, such as trips, birthday parties, pets, get-togethers, and so on, and automatically edit the photos into short videos with soundtracks. The generation of short videos also includes some machine learning features that automatically adjust the effects based on the information in the photos and videos as well as the tempo of the song.

Entering text is also a basic operation of interactive devices. In addition to entering text directly through the keyboard, it can also be entered using dictation and scanning. However, no matter which input method is used, it is inseparable from the application of intelligent technology – dictation involves speech-to-text recognition, keyboard input involves auto-correction of text and text prediction, and scanning involves the extraction of image information. At WWDC in recent years, Apple has focused on optimizing the text input experience.

Dictation

Dictation was already built into the iPhone when it needed to be processed entirely over the Internet; today, dictation works entirely on the device, with the ability to enter emoji in addition to text, and the ability to continue to enter text on the keyboard while dictating. For the most part, the accuracy of the new version of Dictation is pretty good, but when there are a lot of inflected words, Dictation can be problematic and will need to be edited manually, and WWDC 2023 mentions that the new Transformer model2 makes Dictation even more accurate – Dictation is performed on Apple Watch and Dictation is a very important and natural way to enter text on wearable devices like the Apple Watch and Apple Vision Pro, and the accuracy of dictation can go a long way in determining the day-to-day experience of using these two devices.

Auto Correction and Real-time Input Prediction

For direct keyboard input, WWDC 2023 introduced optimized auto-correction and real-time input prediction. AutoCorrect not only corrects words that the user may have misspelled, but it also guesses what specific key the user pressed while typing (including syllables for glide typing, as well as the full keyboard on the Apple Watch). Real-time input prediction automatically pops up or completes words based on the user’s personal style of vocabulary expression. Typically, this can predict the next word, or help you complete a long word you’re not sure how to spell. In practice, the real-time prediction feature has often ‘corrected’ phrases in recent releases, and there were times when I wondered if I’d made a typing error.The Apple Machine Learning website has a lot of research related to the Transformer model and other research, and also reveals background details on some of the techniques that are already in the system today. For example, how to generate text passages quickly, efficiently, and accurately, how Siri is triggered by speech, multimodal macrolanguage modeling, and more. Maybe at the next WWDC, some of the results will become integrated features in the system.

Released in 2023, the Apple Watch Series 9 and Apple Watch Ultra 2 feature the S9 chip with a 4-core neural network engine, which is the basis for a host of new features on the Series 9, including device-side Siri, dictation, and two-finger double-tap gestures. In addition, there are a number of sports and health-related features on the Apple Watch that also involve machine learning, such as motion detection and sleep stage detection.

System Features: Device-Side Siri, Smart Stacking, Gestures

Thanks to the quad-core neural network engine in Apple Watch Series 9, a number of machine learning tasks can be run more efficiently locally on Apple Watch. Siri can perform them on the device side, without an internet connection, so they’re more responsive, and can process and answer questions about a user’s health data locally. For example, in addition to features like weather and timers, you can ask Siri on Apple Watch Series 9 how you slept the day before, what your heart rate was, and more. Apple Watch Series 9 also supports the “two-finger double-tap” gesture, which allows you to use it to perform key actions on the current screen of your Apple Watch, such as answering a call, turning on a timer, and displaying a smart stack. This gesture can be used to perform key actions on the current screen of the Apple Watch, such as answering a call, turning on a timer, displaying Smart Stacks, and more. watchOS 10’s Smart Stacks widget feature also utilizes machine learning to automatically suggest which widget is currently at the top of the stack. Every time you turn the digital crown on the dial, you’ll see the most up-to-date information, such as upcoming meeting schedules, songs currently playing, and more.

Sports and Health: Exercise Detection, Sleep Monitoring

In addition to system-related functions, Apple Watch’s sports and health functions are also inseparable from the participation of intelligent technology. When it comes to sports and health-related functions, Apple often refers to related model training. For example, the sleep stage function of watchOS 9 refers to clinical studies on the sleep stages of different populations and utilizes machine learning on the device side to achieve this. However, it is puzzling that the sleep tracking feature on Apple Watch does not seem to automatically monitor whether the user falls asleep or not, but is turned on by the sleep timer feature, which allows the user to only be informed of the sleep stages during the sleep focus mode; naps, and sleep after the alarm is turned off will not continue to be tracked.

Safety Features

There are also safety features on iPhone and Apple Watch, such as Apple Watch’s fall detection (available on Apple Watch SE, Apple Watch Series 4, or newer, which I triggered the last time I was at the ice rink), which is also based on studying and analyzing the processes involved and forming machine learning models.

There are also a number of assistive features on Apple devices that help people with certain perceptual or functional impairments make better use of modern technology, such as the familiar magnifying glass, VoiceOver screen reader, and AirPods hearing aids. Some of these accessibility features also involve Apple devices’ neural network engines and device-side intelligence, such as creating a personal voice, voice recognition, and more.

Personal Voice is an accessibility feature that analyzes 150 recordings of user phrases to create a synthesized voice on the device side of iPhone or iPad that fits the individual’s voice. Users can have the system read aloud in simulated voices during FaceTime calls, phone calls, augmentative communication apps, and face-to-face conversations, currently in English.

SoundAnalysis, a framework introduced at WWDC, has built-in categorization data for more than 300 sound categories that developers can call directly into their apps, and the system will recognize the sound through the microphone. In iOS, based on this framework, Apple has added sound recognition to the Accessibility feature, which recognizes sounds in the environment, such as doorbells, sirens, and cat and dog barks, and learns specific categories of sound – which is helpful for the hearing impaired. Rumors have been swirling lately about Apple’s possible collaboration with OpenAI and others, WWDC’s focus on improving the AI capabilities of Apple’s platforms, and more. I’ve also been thinking about where Apple is going with this generative AI trend.

From this article, we see that Apple has a deep foundation in machine learning research and applications. However, now that the larger LLM models seem to be so complex that it is impossible to process them directly on the device, will it still be able to stick to its values of “device-side intelligence”? And how will it bring more powerful AI tools to its software platform? Perhaps we’ll have to wait until WWDC 2024 on June 10th to find out.