Machine Learning is, at its core, a way of letting programs learn how to do things by example. It can be used to get a program to self-learn how to play Mario, or how to walk. In this tutorial, Firebase’s Machine Learning Kit (commonly known as ML Kit) will be used to retrieve text, faces, barcodes, and objects from an image.

Firebase’s API provides “models” (knowledge based on data sets) for common actions such as identifying faces or objects in images. Firebase also provides the ability to utilise custom models (via TensorFlow), but only the built-in models will be covered in this tutorial. Additionally, only on-device APIs will be utilised, as cloud APIs require paid plans.

This post is part of The Complete Guide to Firebase. Throughout this tutorial, the official documentation may be useful.

Implementation

As always, the entire Firebase Reference Project is open source, and there is a pull request for adding Firebase ML Kit if you just want to see the code changes required. This tutorial assumes you already have Firebase added to your project.

In this tutorial, an image will be picked from a file selector, then analysed using ML Kit for relevant information. This image can also be provided via a video stream or a camera, but using an existing file is the simplest to demonstrate the core concepts. For each ML Kit model, a raw image and analysed screenshot is provided too.

Setting up Firebase ML Kit

First, add the ML Kit library to your app-level build.gradle:

implementation 'com.google.firebase:firebase-ml-vision:18.0.1'

Next, add the models you’ll be using to your AndroidManifest.xml as a meta-data entry. This allows your app to automatically download the model when your app is installed, instead of waiting until they are actually needed. The following example includes all models used in this tutorial:

<meta-data
    android:name="com.google.firebase.ml.vision.DEPENDENCIES"
    android:value="ocr,face,barcode,label" />

Obtaining a target picture

For the example app, we first define a request code for each of the 4 potential ML Kit actions to be performed (Text, Face, Barcode, or Object). These can be any number under 65535, and are used to distinguish requests:

private val TEXT_RESPONSE = 3331
private val FACE_RESPONSE = 3442
private val BARCODE_RESPONSE = 4443
private val LABEL_RESPONSE = 4143

Next, a button is created for each ML Kit identifying API demonstrated (Text, Face, Barcode, Object). Example XML is available, but the layout doesn’t matter. Each of these buttons then has an onClickListener set, which opens the system’s file picker, filtered to images only:

textButton.setOnClickListener {
    startActivityForResult(
        Intent(Intent.ACTION_GET_CONTENT).setType("image/*"), TEXT_RESPONSE)
}
faceButton.setOnClickListener {
    startActivityForResult(
            Intent(Intent.ACTION_GET_CONTENT).setType("image/*"), FACE_RESPONSE)
}
barcodeButton.setOnClickListener {
    startActivityForResult(
            Intent(Intent.ACTION_GET_CONTENT).setType("image/*"), BARCODE_RESPONSE)
}
objectButton.setOnClickListener {
    startActivityForResult(
            Intent(Intent.ACTION_GET_CONTENT).setType("image/*"), LABEL_RESPONSE)
}

Once the user picks an image, onActivityResult is then called. Once the resultCode has been checked, an Uri of the image can be created using Uri.parse(data!!.dataString). This Uri can then be passed to FirebaseVisionImage.fromFilePath(context, uri) to obtain the image in the required format.

This image (after updating the layout’s preview) is then passed to the appropriate function based on the desired ML Kit operation to be performed:

override fun onActivityResult(requestCode: Int, resultCode: Int, data: Intent?) {
    super.onActivityResult(requestCode, resultCode, data)
    if (resultCode != RESULT_OK) return
    val uri = Uri.parse(data!!.dataString)
    val image = FirebaseVisionImage.fromFilePath(activity!!, uri)
    imagePreview.setImageBitmap(image.bitmapForDebugging)
    output.text = ""
    when (requestCode) {
        TEXT_RESPONSE -> retrieveText(image)
        FACE_RESPONSE -> retrieveFace(image)
        BARCODE_RESPONSE -> retrieveBarcode(image)
        LABEL_RESPONSE -> retrieveLabels(image)
    }
}

Using ML Kit on target picture

For all of the following code examples, image is the FirebaseVisionImage passed from onActivityResult.

Retrieving text from an image

First, make sure ocr is in your DEPENDENCIES meta-data in your AndroidManifest.xml.

Next, calling firebase.onDeviceTextRecognizer.processImage(image) performs an OCR task using ML Kit. The added success listener contains a FirebaseVisionText object. This object (and all text objects below it) all have a .text available that will provide the extracted text. Additionally, all child objects have boundingBox / cornerPoints (the area scanned for this object), confidence (certainty in the text result), and recognizedLanguages (the identified languages).

Inside this overall summary object, there are blocks containing lines containing words containing letters. The following function loops through every word to create a count of all blocks, lines, words, and letters.

private fun retrieveText(image: FirebaseVisionImage) {
    FirebaseVision.getInstance()
            .onDeviceTextRecognizer
            .processImage(image)
            .addOnSuccessListener { textObject ->
                var blocks = 0
                var lines = 0
                var words = 0
                val letters = textObject.textBlocks.sumBy { block ->
                    blocks += 1
                    block.lines.sumBy { line ->
                        lines += 1
                        line.elements.sumBy { word ->
                            words += 1
                            word.text.length
                        }
                    }
                }
                output.text = String.format(getString(R.string.mlkit_text_data),
                        blocks, lines, words, letters, textObject.text)
            }
            .addOnFailureListener { output.text = it.localizedMessage }
}
Input Output

Detecting faces in an image

First, make sure face is included in your DEPENDENCIES meta-data inside AndroidManifest.xml and the face detection dependency has been added alongside core ML Kit:

implementation 'com.google.firebase:firebase-ml-vision:18.0.1'
implementation 'com.google.firebase:firebase-ml-vision-face-model:17.0.2'

Next, a FirebaseVisionFaceDetectorOptions needs to be built. There are 5 toggleable options within:

  • setPerformanceMode() can be FAST (default) or ACCURATE. Accurate is useful for static image analysis.
  • setLandmarkMode() can be NO_LANDMARKS (default) or ALL_LANDMARKS. This feature is useful when the position of specific facial features are needed.
  • setContourMode() can be NO_CONTOURS (default) or ALL_CONTOURS. This feature is useful for getting the outlines of specific facial features.
  • setClassificationMode() can be NO_CLASSIFICATIONS (default) or ALL_CLASSIFICATIONS. This feature is useful when information such as likelihood of smiling / eyes being open is needed.
  • enableTracking() enabled assigning IDs to faces. When enabled, this identifies users across multiple images.

Contours and face tracking cannot currently both be enabled at the same time. As such, for this example we’ll enable everything except contours:

val options = FirebaseVisionFaceDetectorOptions.Builder()
        .setPerformanceMode(FirebaseVisionFaceDetectorOptions.ACCURATE)
        .setLandmarkMode(FirebaseVisionFaceDetectorOptions.ALL_LANDMARKS)
        .setClassificationMode(FirebaseVisionFaceDetectorOptions.ALL_CLASSIFICATIONS)
        .enableTracking()
        .build()

These options are then passed to firebase.getVisionFaceDetector(options) and .detectInImage(image) is called. An (potentially empty) array of faces is returned, each of which contains information on the face’s ID, head rotation, body part positions, etc. Probabilities (eyes open, smiling) are given between 0 and 1, so multiply by 100 to get a confidence percentage. headEulerAngleY refers to the angle the head is looking left or right. 0 degrees is looking directly at the camera, negative values are looking to the left, positive values to the right (from the detected face’s perspective).

FirebaseVision.getInstance()
        .getVisionFaceDetector(options)
        .detectInImage(image)
        .addOnSuccessListener { faces ->
            if (faces.isEmpty()) {
                output.text = getString(R.string.mlkit_no_faces)
            } else {
                var text = ""
                faces.forEach {
                    text += String.format(getString(R.string.mlkit_face_data),
                            it.trackingId,
                            it.leftEyeOpenProbability * 100,
                            it.rightEyeOpenProbability * 100,
                            it.smilingProbability * 100,
                            it.headEulerAngleY,
                            it.headEulerAngleZ)
                }
                output.text = text
            }
        }
        .addOnFailureListener { output.text = it.localizedMessage }
Input Output

Identifying barcodes in an image

As usual, make sure barcode is in the android:value of your DEPENDENCIES in Android.Manifest.xml:

<meta-data
    android:name="com.google.firebase.ml.vision.DEPENDENCIES"
    android:value="barcode" />

Then, create a FirebaseVisionBarcodeDetectorOptions object with all the barcode types wanted. QR Code, UPC, and EAN are the 3 most common formats.

val options = FirebaseVisionBarcodeDetectorOptions.Builder()
        .setBarcodeFormats(
                FirebaseVisionBarcode.FORMAT_QR_CODE,
                FirebaseVisionBarcode.FORMAT_AZTEC,
                FirebaseVisionBarcode.FORMAT_UPC_A,
                FirebaseVisionBarcode.FORMAT_UPC_E,
                FirebaseVisionBarcode.FORMAT_EAN_13)
        .build()

Next, pass this to firebase.getVisionBarcodeDetector(options) and call detectInImage(image). This will return a list of barcodes detected, each of which contains the raw text, the computed value, and the type. For example, a TYPE\_GEO QR code will have a rawText of geo:40.1234,75.1234,100, a type of TYPE_GEO, and a geoPoint object with lat and lng doubles. Using the raw text is generally safest, as all data types resolve to a sensible plain text.

FirebaseVision.getInstance()
        .getVisionBarcodeDetector(options)
        .detectInImage(image)
        .addOnSuccessListener { barcodes ->
            if (barcodes.isEmpty()) {
                output.text = getString(R.string.mlkit_no_barcode)
            } else {
                var string = ""
                barcodes.forEach {
                    string += String.format(getString(R.string.mlkit_barcode_data),
                            it.rawValue,
                            getBarcodeType(it.valueType))
                }
                output.text = string
            }
        }
        .addOnFailureListener { output.text = it.localizedMessage }
Input Output

Labelling all objects in an image

The on-device version of this API returns the most common 400+ items, whereas the paid cloud API uses 10,000+ items. However, the local API is still very powerful, and easily detects surroundings and objects.

As always, first add label to your DEPENDENCIES inside your AndroidManifest.xml. Then add the image labelling library to your app-level build.gradle:

implementation 'com.google.firebase:firebase-ml-vision:18.0.1'
implementation 'com.google.firebase:firebase-ml-vision-image-label-model:17.0.2'

The only option for FirebaseVisionLabelDetector is the minimum confidence (0 to 1) threshold to display a result, and won’t be used in this example. Calling firebase.visionLabelDetector.detectInImage(image) will return a list of FirebaseVisionLabel. Each of these contains a label (the detected object’s name), confidence (the confidence that the object has been detected), and an entityId (a unique ID of the object).

private fun retrieveLabels(image: FirebaseVisionImage) {
    FirebaseVision.getInstance()
            .visionLabelDetector
            .detectInImage(image)
            .addOnSuccessListener { labels ->
                if (labels.isEmpty()) {
                    output.text = getString(R.string.mlkit_no_label)
                } else {
                    var string = ""
                    labels.forEach {
                        string += String.format(getString(R.string.mlkit_label_data),
                                it.label, it.confidence * 100)
                    }
                    output.text = string
                }
            }
            .addOnFailureListener { output.text = it.localizedMessage }
}
Input Output

Web interface

The APIs tab of the web interface shows all APIs currently in use by your apps, as well as which package name is utilising it.

Conclusion

Firebase’s ML Kit provides an excellent introduction to the power of machine learning assisted image analysis. Whilst the OCR, object identification, barcode reading, and facial recognition are just examples, the models are extremely powerful already. Considering the free and offline nature of these models, they are perfect for almost any implementation.

However, the very nature of the image analysis technique means that there is no debug information available. The high success rate of the models should prevent this being an issue, but the lack of information is worth keeping in mind.

From my own experimentation, the facial analysing performed amazingly well, as did the object recognition. However, somewhat surprisingly, the text identification was much less successful. This is likely due to the large number of fonts available, as well as the similarity between many letters, but it’s still unusual that the most common use case is the least accurate!

Previous: Developing Android Apps With Firebase Cloud Storage

Next: Ensuring Your Android App’s Quality With Firebase Crashlytics