• Other Thoughts >

    June 2, 2022

    Federated Learning 101

    One of the ideas you may run across in data privacy circles is federated learning. If you're not familiar with it, federated learning is a type of machine learning. It's appealing for privacy reasons because with federated learning, the data it's working with isn't sent to a centalized server, but rather stays on the local device. Let's start by looking at how machine learning works in a little more depth.

    Machine learning is a data analytics technique that teaches computers to "learn" from experience in order to make predictions. As a comparison, think about how we teach children language. To start, we typically show them something and then say the word for that thing - and we keep repeating this until the child understands the characteristics that identify the item.

    a photo of some cats sitting in a line
    I never turn down a chance to post pictures of cats

    For example, we'd teach a child the word "cat" by pointing out different cats and identifying each of them as a cat. When the child pointed at a dog, and said the word cat, we'd say "no, that's a dog" and keep this process up until the child could reliably identify cats. What the child is doing there is pretty amazing when you think about it, because they're picking out the characteristics of a cat - like 4 legs, 2 eyes, 2 ears, whiskers, has fur, is a certain size, etc - and grouping them together to define what a cat is.

    This same type of thing is what we're doing with machine learning. We feed large amounts of data to a machine learning model to have the computer imitate how humans learn (without having to explicitly program the computer). The traditional set up for machine larning is where the data is collected together in a central location, and then cleaned and prepared before being processed. The model is also trained and refined in that central environment. This is called Centralized Machine Learning, and if the data involved was being collected from laptops and mobile devices, it would look something like the diagram below.

    diagram of centralized machine learning
    Centralized machine learning diagram

    It is also possible for machine learning to happen in a decentralized environment - and not surprisingly, this is called Decentralized Machine Learning. What happens here is that instead of the data coming to the machine learning model, the model goes to the data instead. This system shares the same privacy benefit as federated learning, with the data never leaving the distributed devices. However, in this set up, these "edge" devices aren't learning from one another either.

    diagram of decentralized machine learning
    Decentralized machine learning diagram

    With federated learning the central platform sends the initial model to the distributed devices. The devices use their local data to train the model, and then send their updated version of the model back to the central platform. All of these updated models from all of the distributed devices are aggregated together, and the central platform then sends out an updated version of the central model. This cycle continues to reiterate, allowing the distributed devices to learn from one another as well.

    diagram of federated learning
    Federated learning diagram

    This might seem like a perfect solution to the privacy problem, but there are still some challenges. With Centralized Machine Learning, there is more control of the data - and how it needs to be cleaned and prepared - and of the systems involved. This matters because, as the saying goes "garbage in, garbage out". A fairly well-known real-world example of this problem resulted in Amazon having to scrap a machine learning system it had developed to evaluate resumes to help with it's hiring process. The idea was the system would evaluate the resumes, and identify the top candidates for Amazon to hire. The problem was the system turned out to be biased against women. Why? Because the data used to train the model was the previous 10 years of resumes Amazon had received. Since the tech industry is still predominantly male, the resumes came mainly from men, and the ML model then erroneously determined that being male was a winning characteristic.

    Another challenge can be the diversity and quality involved with the data and the edge devices themselves. This is referred to as the heterogeneous problem in federated learning. To understand the data issue, let's say you had a federated learning system that was using the images people store on their phones to learn something about red maple trees. First, not everyone takes photos of trees. Some people predominantly use their phones to take selfies, pictures of their food, or if they're like me, snapshots of their cats. Second, even if they do take nature photos, where they live will determine if there are ever any red maples in their photos (since red maple trees are only native to certain areas). Next you can add in the quality of the photos too. Maybe it's an older phone that has some limitations or maybe the person involved is just a dubious photographer.

    Beyond the data involved, there's also the diversity of the devices themselves - and their ability to communicate with the central platform. Newer devices are likely to have more processing poweer than older ones. People who live in areas with spotty wifi/cellular coverage might have devices that can't routinely send in their updated models or receive them in return. Finally, even if all of these other issues were non-existant, there's also still a privacy concern involved. While the data itself isn't being sent anywhere, it is still possible for the model that is shared to leak information. This isn't to say that federated learning isn't valuable. It just means that federated learning, like so many things in life, is a bit more complicated than it might appear at first. It's a very interesting field, though, and I hope you enjoyed reading about it here. If you'd like to learn more about machine learning, you might like to watch this MathWorks video. Or if you're really interested, Stanford University offers an online course on Machine Learning through Coursera. I haven't taken it myself, but I'm thinking about it!

    Thanks for reading! Kris

© 2009-. Kristen Chapman. All Rights Reserved.