Machine Learning Fairness: Lessons Learned (Google I/O’19)

[MUSIC PLAYING] JACQUELINE PAN: Hi, everyone. I'm Jackie, and I'm the Lead
Program Manager on ML Fairness here at Google. So what is ML fairness? As some of you may
know, Google's mission is to organize the
world's information and make it universally
accessible and useful. Every one of our users
gives us their trust. And it's our responsibility
to do right by them. And as the impact
and reach of AI has grown across
societies and sectors, it's critical to ethically
design and deploy these systems in a
fair and inclusive way.

Addressing fairness in
AI is an active area of research at
Google, from fostering a diverse and inclusive
workforce that embodies critical and diverse
knowledge to training models to remove or correct
problematic biases. There is no standard
definition of fairness, whether decisions are made
by humans or by machines. Far from a solved
problem, fairness in AI presents both an
opportunity and a challenge. Last summer, Google
outlined principles to guide the responsible
development and use of AI. One of them directly
speaks to ML fairness and making sure that our
technologies don't create or reinforce unfair bias. The principles
further state that we seek to avoid unjust
impacts on people related to sensitive characteristics
such as race, ethnicity, gender, nationality, income,
sexual orientation, ability, and political or
religious belief. Now let's take a look at how
unfair bias might be created or reinforced. An important step
on that path is acknowledging that humans are
at the center of technology design, in addition to
being impacted by it. And humans have not always
made product design decisions that are in line with
the needs of everyone.

For example, because female
body-type crash test dummies weren't required until
2011, female drivers were more likely
than male drivers to be severely injured
in an accident. Band-Aids have long been
manufactured in a single color– a soft pink. In this tweet, you see
the personal experience of an individual using a
Band-Aid that matches his skin tone for the first time. A product that's designed and
intended for widespread use shouldn't fail for an
individual because of something that they can't change
about themselves. Products and technology
should just work for everyone. These choices may not
have been deliberate, but they still
reinforce the importance of being thoughtful
about technology design and the impact
it may have on humans. Why does Google care
about these problems? Well, our users are
diverse, and it's important that we provide an
experience that works equally well across all of our users. The good news is that
humans, you, have the power to approach these
problems differently, and to create technology
that is fair and more inclusive for more people. I'll give you a sense
of what that means.

Take a look at these images. You'll notice where the
label "wedding" was applied to the images on the left,
and where it wasn't, the image on the right. The labels in these
photos demonstrate how one open source image
classifier trained on the Open Images Dataset does not properly
recognize wedding traditions from different
parts of the world. Open datasets, like open images,
are a necessary and critical part of developing
useful ML models, but some open
source datasets have been found to be geographically
skewed based on how and where they were collected. To bring greater
geographic diversity to open images, last year, we
enabled the global community of crowdsourced app users
to photograph the world around them and make their
photos available to researchers and developers as a part of the
Open Images Extended Dataset. We know that this is just an
early step on a long journey. And to build
inclusive ML products, training data must
represent global diversity along several dimensions. These are complex
sociotechnical challenges, and they need to be interrogated
from many different angles.

It's about problem
formation and how you think about these systems
with human impact in mind. Let's talk a little bit
more about these challenges and where they can
manifest in an ML pipeline. Unfairness can enter the
system at any point in the ML pipeline, from data collection
and handling to model training to end use. Rarely can you identify a single
cause of or a single solution to these problems. Far more often, various
causes interact in ML systems to produce problematic outcomes. And a range of
solutions is needed. We try to disentangle
these interactions to identify root causes
and to find ways forward. This approach spans more than
just one team or discipline. ML fairness is an initiative to
help address these challenges. And it takes a lot of
different individuals with different
backgrounds to do this. We need to ask ourselves
questions like, how do people feel about fairness when they're
interacting with an ML system? How can you make systems
more transparent to users? And what's the societal
impact of an ML system? Bias problems run deep,
and they don't always manifest in the same way.

As a result, we've had to
learn different techniques of addressing these challenges. Now we'll walk through some
of the lessons that Google has learned in evaluating
and improving our products, as well as tools and
techniques that we're developing in this race. Here to tell you more
about this is Tulsee. TULSEE DOSHI: Awesome. Thanks, Jackie. Hi, everyone. My name is Tulsee, and I lead
product for the ML Fairness effort here at Google. Today, I'll talk about three
different angles in which we've thought about and
acted on fairness concerns in our
products, and the lessons that we've learned from that.

We'll also walk through our next
steps, tools, and techniques that we're developing. Of course, we know
that the lessons we're going to talk
about today are only some of the many ways
of tackling the problem. In fact, as you heard in
the keynote on Tuesday, we're continuing to
develop new methods, such as [INAUDIBLE],, to
understand our models and to improve them. And we hope to keep
learning with you. So with that, let's
start with data. As Jackie mentioned,
datasets are a key part of the ML development process. Data trains a model
and informs what a model learns from and sees. Data is also a critical part
of evaluating the model. The datasets we choose to
evaluate on indicate what we know about how the
model performs, and when it performs
well or doesn't.

So let's start with an example. What you see on the screen
here is a screenshot from a game called
Quick Draw that was developed through the
Google AI Experiments program. In this game, people drew
images of different objects around the world, like
shoes or trees or cars. And we use those images to train
an image classification model. This model could
then play a game with the users, where a
user would draw an image and the model would guess
what that image was of. Here you see a whole bunch
of drawings of shoes. And actually, we
were really excited, because what better way
to get diverse input from a whole bunch of users than
to launch something globally where a whole bunch of
users across the world could draw images for what
they perceived an object to look like? But what we found as this
model started to collect data was that most of the
images that users drew of shoes looked like
that shoe in the top right, the blue shoe.

So over time, as the model
saw more and more examples, it started to learn that a
shoe looked a certain way like that top right
shoe, and wasn't able to recognize the
shoe in the bottom right, the orange shoe. Even though we were
able to get data from a diverse set
of users, the shoes that the users chose
to draw or the users who actually engaged
with the product at all were skewed, and led
to skewed training data in what we actually received. This is a social
issue first, which is then exacerbated by our
technical implementation. Because when we're making
classification decisions that divide up the world into
parts, even if those parts are what is a shoe and
what isn't a shoe, we're making fundamental
judgment calls about what deserves
to be in one part or what deserves
to be in the other.

It's easier to deal with when
we're talking about shoes, but it's harder to
talk about when we're classifying images of people. An example of this is
the Google Clips camera. This camera was designed to
recognize memorable moments in real-time streaming video. The idea is that
it automatically captures memorable motion photos
of friends, of family, or even of pets. And we designed the
Google Clips camera to have equitable
outcomes for all users.

It, like all of our
camera products, should work for all families,
no matter who or where they are. It should work for
people of all skin tones, all age ranges,
and in all poses, and in all lighting conditions. As we started to
build this system, we realized that if we only
created training data that represented certain
types of families, the model would
also only recognize certain types of families. So we had to do a lot of work
to increase our training data's coverage and to make sure that
it would recognize everyone. We went global to collect these
datasets, collecting datasets of different types of families
in different environments conditions in different
lighting conditions.

And in doing so, we
were able to make sure that not only could
we train a model that had diverse outcomes,
but that we could also evaluate this constrained
on a whole bunch of different variables
like lighting or space. This is something that
we're continuing to do, continuing to create automatic
fairness tests for our systems so that we can see how
they change over time and to continue to ensure that
they are inclusive of everyone.

The biggest lesson we've
learned in this process is how important it is to
build training and evaluation datasets that represent all
the nuances of our target population. This both means making sure
that the data that we collect is diverse and
representative, but also that the different
contexts of the way that the users are
providing us this data is taken into account. Even if you have a
diverse set of users, that doesn't mean that the
images of shoes you get will be diverse. And so thinking about those
nuances and the trade-offs that might occur when you're
collecting your data is super important. Additionally, it's also
important to reflect on who that target
population might leave out. Who might not actually have
access to this product? Where are the blind spots
in who we're reaching? And lastly, how will the
data that you're collecting grow and change over time? As our users use our
products, they very rarely use them in exactly the
way we anticipated them to. And so what happens is the
way that we collect data, or the data that we even
need to be collecting, changes over time.

And it's important that
our collection methods and our maintenance
methods are equally diverse as that initial process. But even if you have a perfectly
balanced, wonderful training dataset, that
doesn't necessarily imply that the output of your
model will be perfectly fair. Also, it can be hard to collect
completely diverse datasets at the start of a process. And you don't always know
what it is that you're missing from the beginning. Where are your blind spots
in what you're trying to do? Because of that, it's
always important to test, to test and measure these issues
at scale for individual groups, so that we can actually
identify where our model may not be performing as
well, and where we might want to think about
more principled improvements. The benefit of
measurement is also that you can start tracking
these changes over time. You can understand
how the model works. Similar to the way
that you would always want to have metrics for
your model as a whole, it's important to think about
how you slice those metrics, and how you can provide yourself
a holistic understanding of how this model or system
works for everybody.

What's interesting is that
different fairness concerns may require different metrics,
even within the same product experience. A disproportionate
performance problem is when, for example, a model
works well for one group, but may not work as
well for another. For example, you
could have a model that doesn't recognize some
subset of users or errors more for that subset of users. In contrast, a
representational harm problem is when a model showcases
an offensive stereotype or harmful association.

Maybe this doesn't
necessarily happen at scale. But even a single instance
can be hurtful and harmful to a set of users. And this requires
a different way of stress-testing the system. Here's an example where both
of those metrics may apply. The screenshot you see is from
our Jigsaw Perspective API. This API is designed to
detect hate and harassment in the context of
online conversations. The idea is, given a
particular sentence, we can classify whether
or not that sentence is perceived likely to be toxic. We have this API externally. So our users can actually write
sentences and give us feedback. And what we found
was one of our users articulated a particular
example that you see here. The sentence, "I am straight,"
is given a score of 0.04, and is classified as "unlikely
to be perceived as toxic." Whereas the
sentence, "I am gay," was given a score
of 0.86, and was classified as "likely to
be perceived as toxic." Both of these are innocuous
identity statements, but one was given a
significantly higher score.

This is something we would never
want to see in our products. And an example–
we not only wanted to fix this immediate
example, but we actually wanted to understand and
quantify these issues to ensure that we could
tackle this appropriately. The first thing we looked
at was this concept of "representational
harm," understanding these counterfactual
differences. For a particular sentence,
we would want the sentence to be classified the
same way regardless of the identity referenced
in the sentence. Whether it's, "I am
Muslim," "I am Jewish," or "I am Christian,"
you would expect the score perceived by the
classifier to be the same.

Being able to
provide these scores allowed us to understand
how the system performed. It allowed us to identify
places where our model might be more likely to be
biased, and allowed us to go in and actually
understand those concerns more deeply. But we also wanted to
understand overall error rates for particular groups. Were there particular
identities where, when referenced in
comments, we were more likely to have
errors versus others? This is where the
disproportionate performance question comes in. We wanted to develop
metrics on average for a particular
identity term that showcased, across
a set of comments, whether or not we were
more likely to classify. This was in both directions–
misclassifying something as toxic, but also
misclassifying something as not toxic when it truly was
a harmful statement. The three metrics you see
here capture different ways of looking at that problem. And the darker the color,
the darker the purple, the more likely we were
to have error rates.

And you can see that in the
first version of this model, there were huge disparities
between different groups. So OK, we were able to
measure the problem. But then how do we improve it? How do we make sure
this doesn't happen? A lot of research
has been published in the last few years,
both internally within Google as
well as externally, that look at how to
train and improve our models in a way that still
allows them to be stable, to be resource-efficient,
and to be accurate, so that we can still deploy
them in production use cases.

These approaches balance the
simplicity of implementation with the required accuracy and
quality that we would want. The simplest way to
think about this problem would be through the idea
of removals or block lists, taking steps to ensure that your
model can't access information in a way that could
lead to skewed outcomes. Take, for example, the sentence,
"Some people are Indian." We may actually want to remove
that identity term altogether, and replace it with a more
generic tag, "identity." if you do this for every
single identity term, your model wouldn't even have
access to identity information. It would simply know that
the sentence referenced an identity.

As a result, it couldn't
make different decisions for different identities
or different user groups. This is a great way to make
sure that your model is agnostic of a particular
definition of an individual. At the same time,
it can be harmful. It actually might be
useful in certain cases to know when identity terms
are used in a way that is offensive or harmful. If a particular
term is often used in a negative or
derogatory context, we would want to
know that, so we could classify that as toxic. Sometimes, this context is
actually really important. But it's important
that we capture it in a nuanced and contextual way.

Another way to think
about it is to go back to that first lesson, and
look back at the data. We can enable our
models to sample data from areas in which the model
seems to be underperforming. We could do this both manually
as well as algorithmically. On the manual side,
what you see on the right is a quote collected
through Google's Project Respect effort.

Through Project Respect,
we went globally to collect more
and more comments of positive representations
of identity. This comment is
from a pride parade, where someone from Lithuania
talks about their gay friends, and how they're brilliant
and amazing people. Positive reflections of identity
are great examples for us to train our model, and to
support the model in developing a context and nuanced
understanding of comments, especially when the
model is usually trained from online
comments that may not always have the same flavor.

We can also enable the model
to do this algorithmically through active sampling. The model can
identify the places where it has the least
confidence in its decision making, where it might
be underperforming. And it can actively
go out and sample more from the training dataset that
represents that type of data. We can continue to even
build more and more examples through synthetic examples. Similar to what you
saw at the beginning, we can create these short
sentences, like "I am," "He is," "My friends are." And these sentences can
continue to provide the model understandings of when identity
can be used in natural context. We can even make changes
directly to our models by updating the models'
loss functions to minimize difference in performance
between different groups of individuals.

Adversarial training
and min diff loss, two of the research
methods in this space, have actively looked at how
to effect your loss function to keep the model stable
and to keep it lightweight, while still enforcing
this kind of a penalty. What you saw earlier
were the results of the Toxicity V1 model. And as we made changes,
especially in terms of creating manual
synthetic examples and augmenting the
data performance, we were able to see
real improvements. This is the toxicity
V6 model, where you can see that the
colors get lighter as the performance for
individual identity groups gets better.

We're really excited about the
progress that we've made here. But we know that there is
still a long ways to go. The results you see here are on
synthetic data, short identity statements like I
talked about earlier. But the story of bias can
become much more complex when you're talking about
real data, comments that are actually used in the wild.

pexels photo 7210258

We're currently working
on evaluating our systems on real comments, building
up these datasets, and then trying to enhance our
understanding of performance and improvements in that space. While we've still seen
progress on real comments and improvements
from our changes, we know that this will actually
help more once we start looking at these real datasets. And actually, there's
a Kaggle competition live now if you're interested
in checking this out more. Overall, the biggest lesson is
"Test early and test often." Measuring your systems
is critical to actually understanding where
the problems exist, where our users
might be facing risk, or where our products
aren't working the way that we intend for them to be.

Also, bias can affect
the user experience and cause issues in
many different forms. So it's important
to develop methods for measuring the
scale of each problem. Even a particular single
product may manifest bias in different ways. So we want to actually be sure
to measure those metrics, also. The other thing to
note is it's not always quantitative metrics. Qualitative metrics,
user research, and adversarial
testing of really, actually stress-testing
and poking at your product manually, can also be
really, really valuable. Lastly, it is possible
to take proactive steps in modeling that are aware of
your production constraints.

These techniques
have been invaluable in our own internal use cases. And we will continue to
publish these methods for you to use, as well. You can actually go to to learn more. I also want to
talk about design. And this is our third
lesson for today. Because context is
really important. The way that our users interact
with our results is different. And our design decisions around
the results have consequences. Because the experience that
a user actually has with a product extends beyond the
performance of the model. It relates to how
users are actually engaging with the results. What are they seeing? What kind of information
are they being given? What kind of information do
they have that maybe the model may not have? Let's look at an example. Here you see an example from
the Google Translate product. And what you see
here is a translation from Turkish to English. Turkish is a
gender-neutral language, which means that in Turkish,
nouns aren't gendered. And "he," "she," or "it" are all
referenced through the pronoun, "O." I actually misspoke.

I believe not all nouns are
gendered, but some may be. Thus, while the sentences
in Turkish, in this case, don't actually specify
gender, our product translates it to
common stereotypes. "She is a nurse,"
while "He is a doctor." So why does that happen? Well, Google Translate learns
from hundreds of millions of already translated
examples from the web. And it therefore also learns
the historical and social trends that have come with these
hundreds of millions of examples, the
historical trends of how we've thought of occupations
in society thus far.

So it skews
masculine for doctor, whereas it skews
feminine for nurse. As we started to look
into this problem, we went back to those
first two lessons. OK, how can we make the
training data more diverse? How can we make it
more representative of the full gender diversity? Also, how could we
better train a model? How could we improve
and measure the space, and then make modeling changes? Both of these questions
are important. But what we started to
realize is how important context was in this situation.

Take, for example, the
sentence, "Casey is my friend." Let's say we want to translate
to Spanish, in which case friend could be "amigo," the
masculine version, or "amiga," the feminine version. Well, how do we know if Casey
is a male, a female, or a gender non-binary friend? We don't have that context. Even a perfectly
precise model trained on diverse data that represents
all kinds of professions would not have that context.

And so we realized
that even if we do make our understandings
of terms more neutral, and even if we were to
build up model precision, we would actually want to give
this choice to the user, who actually understands
what they were trying to achieve with the
sentence in the translation. What we did is choose to
provide that to our users in the form of options
and selections.

We translate "friend"
both to "amigo" and to "amiga," so
that the user can make a choice that is
informed based on the context that they have. Currently, this solution is only
available for a few languages. And it's also only available
for single terms like "friend." But we're actively working
on trying to expand it to more languages,
and also trying to be inclusive of larger
sentences and longer contexts, so we can actually tackle
the example you saw earlier.

We're excited about this
line of thinking, though, because it enables us to think
about fairness beyond simply the data and the
model, but actually as a holistic experience that a
user engages with every day, and trying to make
sure that we actually build those communication lines
between the product and the end consumer. The biggest lesson we learned
here is that context is key. Think about the
ways that your user will be interacting with your
product and the information that they may have that
the model doesn't have, or the information that
the model might have that the user doesn't have.

How do you enable the users
to communicate effectively with your product, but also
get back the right transparency from it? Sometimes, this is about
providing user options, like you saw with Translate. Sometimes, it's also just
about providing more context about the model's decisions,
and being a little bit more explainable and interpretable. The other piece that's
important is making sure that you get feedback
from diverse users. In this case, this was users
who spoke different languages, and who had different
definitions of identity.

But it's also
important to make sure, as you're trying to get
feedback from users, that you think about
the different ways in which these users
provide you feedback. Not every user is equally
likely to be accepting of the same feedback
mechanism, or equally likely to proactively give you
feedback in, say, a feedback form on your product.

So it's important to
actually make sure that whether that be
through user research, or through dog fooding, or
through different feedback mechanisms in your
product, that you identify different ways to access
different communities who might be more or less likely
to provide that information. Lastly, identify ways to
enable multiple experiences in your product. Identify the places where
there could be more than one correct answer, for example. And find ways to enable users to
have that different experience. Representing human culture
and all of its differences requires more than a theoretical
and technical toolkit. It requires a much more rich and
context-dependent experience. And that is really, at
the end of the day, what we want to provide our users. We hope that those
lessons were helpful. They've been lessons that we've
been really, really grateful to learn, and that we've started
to execute in our own products. But what's next? We're starting to put these
lessons into practice. And while we know that product
development in ML fairness is a context-dependent
experience, we do want to start building
some of the fundamentals in terms of tools, resources,
and best practices.

Because we know how important
it is to at least start with those metrics,
start with the ability to collect diverse data, start
with consistent communication. One of the first things
we're thinking about is transparency frameworks. We want to create and
leverage frameworks that drive consistent communication–
both within Google, but also with the
industry at large– about fairness and
other risks that might exist with data
collection and modeling. We also want to build
tools and techniques, develop and socialize tools that
enable evaluating and improving fairness concerns. Let's talk about
transparency first. Today, we're committing to
a framework for transparency that ensures that we think
about, measure, and communicate about our models and data
in a way that is consistent. This is not about achieving
perfection in our data on models, although of
course we hope to get there.

It's about the context
under which something is supposed to be used. What are its intended use cases? What is it not intended for? And how does it perform
across various users? We released our first
Data Card last October as part of the Open
Images Extended Dataset that you heard Jackie
talk about earlier. This Data Card allows us
to answer questions like, what are the intended use
cases of this dataset? What is the nature
of the content? What data was excluded, if any? Who collected the data? It also allows us
to go into some of the fairness considerations.

Who labeled the data, and what
information did they have? How was the data sourced? And what is the
distribution of it? For Open Images
Extended, for example, while you can see that the
geographic distribution is extremely diverse, 80% percent
of the data comes from India. This is an important
finding for anyone who wants to use this
dataset, both for training or for testing purposes. It might inform how you
interpret your results. It also might inform
whether or not you choose to augment your
dataset with something else, for example. This kind of transparency
allows for open communication about what the actual use cases
of this dataset should be, and where it may have flaws.

We want to take this a step
further with Model Cards. Here you see an
example screenshot for the Jigsaw
Perspective Toxicity API that we talked about earlier. With Model Cards, we want
to be able to give you an overview of what
the model is about, what metrics we use
to think about it, how it was architected, how it
was trained, how it was tested, what we think it
should be used for, and where we believe
that it has limitations. We hope that the
Model Card framework will work across models,
so not just for something like toxicity, but also
for a face detection model, or for any other use case
that we can think of. In each case, the framework
should be consistent. We can look at metrics. We can look at use cases. We can look at the
training and test data. And we can look at
the limitations. Each Model Card will also
have the quantitative metrics that tell you how it performs.

Here, for example, you
can see an example set of metrics sliced by age. You can see the
performance on all ages, on the child age bucket,
on the adult age bucket, and on the senior age bucket. So how do you create
those metrics? How do you compute them? Well, we also want to be
able to provide you the tools to do this analysis, to be able
to create your own model cards, and also to be able to
improve your models over time.

The first piece of the
set of tools and resources is open datasets. The Open Images Extended
Dataset is one of many datasets that we have and hope
to continue to open source in the coming years. In this example, the Open
Images Extended Dataset collects data from
crowdsourced users who are taking images of
objects in their own regions of the world. You can see, for example,
how a hospital or food might look different in
different places, and how important it is
for us to have that data.

With the live
Kaggle competition, we also have open
sourced a dataset related to the
Perspective Toxicity API. I mentioned earlier
how important it is for us to look at
real comments and real data. So here, the Jigsaw
team has open sourced a dataset of real
comments from around the web. Each of these comments is
annotated with the identity that the comment references,
as well as whether or not the comment is toxic,
as well as other factors about the comment, as well. We hope that datasets
like these continue to be able to advance the
conversation, the evaluation, and the improvements
of fairness. Once you have a dataset,
the question becomes, how do you take
that step further? How do you evaluate the model? One thing you can do today
is deep-dive with the What-If tool.

The What-If tool is available
as a Tensorboard plugin, as well as a Jupyter Notebook. You can deep-dive into
specific examples, and see how changing features
actually affects your outcome. You can understand different
fairness definitions, and how modifying the
threshold of your model might actually change the
goals that you're achieving. Here's a screenshot
of the What-If tool.

What you see here on the right
is a whole bunch of data points that are classified
by your model. Data points of a
similar color have been given a similar score. You can select a
particular data point, and then with the
features on the right, you can actually modify
the feature value to see how changing the
input would potentially change the output. For example, if I changed the
age defined in this example, does it actually change
my classification? If it does, that might
tell me something about how age is
influencing my model, and where potentially,
there may be biases, or where I need to
deep-dive a little bit more. We also hope to take
this a step further with Fairness
Indicators, which will be launched later this year. Fairness Indicators
will be a tool that is built on top of
TensorFlow Model Analysis, and as a result, can work end
to end with the TFX pipeline.

TSX stands for
TensorFlow Extended. And it's a platform that
allows you to train, evaluate, and serve your
models, all in one go. And so we're hoping to build
fairness into this workflow and into these processes. But Fairness Indicators
will also work alone. It'll work as an
independent tool that can be used with
any production pipeline. We hope that with
Fairness Indicators, you'll be able to actually
look at data on a large scale, and see actually how
your model performs. You can compute fairness metrics
for any individual group, and visualize these comparisons
to a baseline slice. Here, for example, you
can see the baseline slice as the overall average
metric in blue, and then you can
actually compare how individual groups
or individual slices compare to that baseline. For example, some may have
a higher false negative rate than average, while
others may have a lower. We'll provide feedback
about these main metrics that we believe have been useful
for various fairness use cases.

You can then use
Fairness Indicators also to evaluate at
multiple thresholds to understand how
performance changes, and how maybe
changes to your model could actually lead to different
outcomes for different users. If you find a slice
that doesn't seem to be performing as well
as you expect it to, you can actually take
that slice further by deep-diving immediately
with the What-If tool. We will also be providing
confidence intervals, so that you can understand where
the differences that you're seeing are significant,
and where we may actually need more data to better
understand the problem. With Fairness
Indicators, we'll also be launching case
studies for how we've leveraged these
metrics and improvements in the past internally
in our own products. We hope that this will
help provide context about where we found
certain metrics useful, what kinds of insights
they've provided us, and where we found that certain
metrics actually haven't really served the full purpose. We'll also provide
benchmark datasets that can be immediately used
for vision and text use cases. We hope that Fairness
Indicators will simply be a start to being able to
ask questions of our models, understand fairness concerns,
and then eventually, over time, improve them.

Our commitment to you
is that we continue to measure, improve,
and share our learnings related to fairness. It is important not only
that we make our own products work for all users,
but that we continue to share these best practices
and learnings so that we, as an industry, can continue
to develop fairer products– products that work
equitably for everybody. One thing I do
want to underscore is that we do know
that in order to create diverse products, products
that work for diverse users, it is also important to have
diverse voices in the room.

This not only means
making sure that we have diverse voices internally
working on our products, but also means
that we include you as the community
in this process. We want your feedback
on our products, but we also want
to learn from you about how you're tackling
fairness and inclusion in your own work, what
lessons you're learning, what resources you're
finding useful. And we want to work with you to
continue and build and develop this resource toolkit,
so that we can continue, as an industry, to
build products that are inclusive for everyone. Thank you. [MUSIC PLAYING] .

You May Also Like