Exploring AI with Emily M. Bender
Digital Citizen
Join us on a journey to learn more about the intersection of linguistics and AI with special guest Emily M. Bender. Come with us as we learn how linguistics functions in modern language models like ChatGPT.
Episode Notes
Discover the origins of language models, the negative implications of sourcing data to train these technologies, and the value of authenticity.
▶️ Guest Interview - Emily M. Bender
- Learn more about Emily M. Bender
- Read On the Danger of Stochastic Parrots(2021) by Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell.
- Check out the publications by cognitive scientist Abeba Birhane.
- See work from AI research scientist Meg Mitchell.
🗣️ Discussion Points
- Emily M. Bender is a Professor of Linguistics at the University of Washington. Her work focuses on grammar, engineering, and the societal impacts of language technology. She’s spoken and written about what it means to make informed decisions about AI and large language models such as ChatGPT.
- Artificial Intelligence (AI) is a marketing term developed in the 1950s by John McCarthy. It refers to an area of computer science. AI is a technology built using natural language processing and linguistics, the science of how language works. Understanding how language works is necessary to comprehend large language models’ potential misuse and limitations.
- Language model is the term for a type of technology designed to model the distribution of word forms in text. While early language models simply determined the relative frequency of words in a text, today’s language models are bigger in terms of the data they store and the language they are trained on. As a society, we must continue reminding ourselves that synthetic text is not a credible information source. Before sharing information, it’s smart to verify that something was written by a human rather than a machine. Valuing authenticity and citations are some of the most important things we can do.
- Distributional biases are generated in the data output used for large language models. The less care we put into curating training data, the more various patterns and systems of oppression will be reproduced, regardless of whether they are presented as fact or fiction in the end result.
- People are being falsely accused of using ChatGPT to produce professional and academic work. Part of the problem is that there is currently no watermarking at the source. There is a major need for regulation and accountability around synthetic text nationally.
- Being a good digital citizen means avoiding using products built on data theft and labor exploitation. On an individual level, we should insist on transparency regarding synthetic media. We can also continue to increase the value of authenticity.
🔵 Find Us
- Digital Citizen Website: fastmail.com/digitalcitizen.
- Check out our blog.
- Tweet us @Fastmail.
- Follow us on Mastodon: @fastmail@mastodon.social.
💙 Review Us
If you love this show, please leave us a review on Apple Podcasts or wherever you listen to podcasts. Take our survey to tell us what you think at digitalcitizenshow.com/survey.
Ricardo Signes: Welcome back to the Digital Citizen podcast. I’m Ricardo Signes at Fastmail, the email provider of choice for savvy digital citizens everywhere. Here with me is my colleague, Haley Hnatuk. Haley, let him know who you are.
Haley Hnatuk: Hi, everyone. I’m Haley Hnatuk Fastmail’s Senior Podcast Producer, Marketing Specialist, and the Co-host of Digital Citizen. Who are you going to be chatting with today, Rik?
Ricardo Signes: Today I’ll be talking with Emily Bender. She’s a Professor of Linguistics at the University of Washington, who works on things like grammar, engineering, and the societal impacts of language technology. Recently, she’s been doing a whole lot of public scholarship, combating AI hype, which is the core of what we’re going to talk about today. We’re also going to talk about the intersection of linguistics and AI, and whether people make informed decisions about how to engage with large language models.
Haley Hnatuk: Well, Rik, can you tell me a little bit about your early exposure to AI?
Ricardo Signes: Look, I’ve been following along with the GPT language model for for a while before the launch of ChatGPT. And it seemed neat. I’ve got a book on my shelf somewhere called Transformer Poetry[:Classic Poetry Reimagined by Artificial Intelligence by Kane Hsieh] that was someone feeding parts of poems into the language model and getting new second halves out, which I thought was fun. And when the ChatGPT launched, I thought it was sort of a delightful way to get nonsense out. And for me, I sort of soured on it when I started to see people using it to produce serious work that they could submit or use as if it was writing produced by human who’d carefully thought through it all. What about you?
Haley Hnatuk: Yeah, I think my earliest exposure to ChatGPT was in poetry class. We used it to give us some prompts of things to write and it sent us these really surreal, strange prompts, like, write a poem about if you were a Pillsbury Doughboy in the oven, and other things like that. And since then, I really haven’t spent a lot of time interfacing with ChatGPT at all. But yeah, that was my first exposure to it.
Ricardo Signes: Well, we’re gonna hear more about things people do or should do or shouldn’t do with ChatGPT and other large language models in our conversation with Emily Bender. And then at the end, we will have some takeaways, things you can actually do to become a better digital citizen. You can also find those at our website at fastmail.com/digitalcitizen.
Ricardo Signes: So, today, I am here with Emily Bender. When I first heard about your work, It was about ChatGPT, the chatbot from OpenAI that was released, I guess, late last year. What does linguistics have to do with artificial intelligence?
Emily M. Bender: Yeah. So, artificial intelligence, first of all, isn’t really a thing, but that word is used as a marketing term and also to refer to an area of study within computer science where it’s also a marketing term. It was developed in the 1950s by John McCarthy to attract funding. But if you think about trying to build technology where you can interface with computers using natural language, then you’re talking about natural language processing and then linguistics, which is basically the science of how language works is a really important component. So that’s one thing it has to do with it.
Emily M. Bender: The other thing is understanding how language works is really key to understanding the sleight of hand and other sort of quackery that’s going on with large language models. And so that’s part of why I’m using my expertise in linguistics to speak out in this way.
Ricardo Signes: So right, you said language model and that’s the phrase we all started seeing when ChatGPT came out and said, “Well actually I’m a language model.” What is a language model, and especially a large one? Because it always says it’s a large language model.
Emily M. Bender: Yeah. Okay. So, language model is the name for a kind of technology that is designed to model the distribution of word forms in text. And the really old language models were called n-gram language models. And you’d grab some collection of texts called a corpus and just count up the words. And so, a unigram language model is the relative frequency of all the different kinds of words in the text. And if you think about old spellcheck where first you just got the squiggle and then you got some possible things that might’ve been, and then you got the possible things that might’ve been ranked by likelihood. That’s a language model at work. So, the very first ones are just like, okay, well which words are more likely than other words. You could do a bigram language model given one previous word, what’s the relative likelihood of all the other words, and so on. What we have now, so things like the generative pre-trained transformer are bigger in terms of the amount of data they’re storing and the amount of data they’re trained on language models. So, the whole thing is just designed around modeling the distribution of word forms and text. They’re called large language models because they have both very large numbers of parameters that is the weights that they’re using to store the information. And by information, again, I just mean information about the distribution of word forms and text, and also, absolutely enormous training data. So, just large swaths of texts taken from wherever it can be grabbed. Now, it’s not the whole internet. And this is something I hear a lot people say, “It was trained on the whole internet.” That’s not a thing. The internet is not something that you can go get somewhere.
Ricardo Signes: Download a copy, yeah.
Emily M. Bender: Right, exactly. But it is trained on a lot of texts that is indiscriminately collected.
Ricardo Signes: All right. I’m going to come back to that. Before we talk about what the computers do with the language, for humans, for you and me, what’s the function of language? What are we using it for?
Emily M. Bender: Well, we use language for a lot of things. I think its primary function is face to face communication, which we then… that itself is something that’s important for building relationships, for building our sense of self, for building societies, for building systems of rules, for playing games. Forming language is just inherent in almost everything we do or can be added on, right? I just went for a jog, that doesn’t require language, but I was listening to a podcast while I did it. That does.
Ricardo Signes: A lot of that answer was really straightforward and clear to me. I’m curious to hear you say more about the… having the people in the room whose data was brought into the training material. What are the ramifications of that? What do you have in mind?
Emily M. Bender Yeah. So, I’m thinking in particular of what happened with the Crisis Text Line where the texts that people sent in moments of crisis to the Crisis Text Line were repurposed as training data, I believe for trying to make virtual sales assistants more empathetic.
Ricardo Signes: Wow.
Emily M. Bender: Which is just hugely unethical. And we are in a moment right now where somehow as a society we have gotten excited about building things out of big data. And so, given a pass to the big data collectors to amass these data sets that if you take a step back you think, no way is that actually an appropriate thing to do. And part of the problem is that prior to digitized everything, much of our lives was ephemeral, right? If you meet somebody on the corner and have a conversation, you might each both remember that, somebody might’ve overheard it. But for that conversation to persist, it has to be someone talking about their memory of it. If you have a conversation that’s being recorded in any form, including if texting, then it’s there as data that can actually be revisited and collected and used. And I don’t think our regulatory system has really caught up to that.
Ricardo Signes: Yeah, I’ve got more questions about the intake of this data, but going back to use cases, I think the use case that seems to be the one showing up most in my face right now is the presentation of these text generators as information desks where I show up and I say, “Here’s my question and I’m given an answer.” What’s a metaphor, if any, that can help people understand what’s happening? What’s— How can we show people what the system is really doing?
Emily M. Bender: One of my favorite ones is actually the Magic 8 Ball, if you remember that toy.
Ricardo Signes: Okay, for sure.
Emily M. Bender: So, you shake this plastic toy and there’s a die inside that’s got various answers written on it. And so, you pose a question like, “Is it going to rain today?” And it might say something like, “Unclear. Ask again later.” Which is a response that works for any question or something like signs point to yes, which only works if you’ve asked a yes/no question. And so, when you’re playing with this toy, you sort of quickly get into the habit of asking yes/no questions so that you don’t get incoherent dialogues like, “What should I have for lunch?”. Signs point to yes, which doesn’t work. And I think it’s helpful to think of ChatGPT in the like as similar. We are sending in questions so that we can make sense of the output, but the output isn’t answering our question. We might as well be asking the Magic 8 Ball.
Ricardo Signes: Right. But it really feels like it is.
Emily M. Bender: Yeah, it really feels like it is. I’ve got one more metaphor for you in case this helps. Do you remember playing the game on early smartphones where you would say start typing a text message, "I’m sorry I’m late, I…”, and then just hit the middle suggestion over and over again?
Ricardo Signes: Yes.
Emily M. Bender: Did you ever play that? Yeah. And what you got on your phone was different from what someone else playing the phone game got on their phone because there’s some local data stored about your text frequency, right?
Ricardo Signes: Right.
Emily M. Bender: And we understood that as reflecting how we used our own phones. And so, it was funny. That’s what ChatGPT is doing except that it’s not just our data, it’s whatever data they were able to collect off of the internet plus some crowdsourced, have people write dialogues, stuff going on. So, not a source of information but really compelling like you said. And that’s because it is so fluent. It is designed to thoroughly mimic the use of language in a certain register, right? And you can make it— you can say, you know, please talk like a teenager at the mall and then maybe get it to move a little bit, but it’s default, it’s going to be the sort of authoritative friendly information source because it was built that way. And prior to this, for the most part, if we ever encountered, well-formed text in the wild, it’s because some person had written it.
Ricardo Signes: Right.
Emily M. Bender: That’s not true anymore. The other part of it is the way we understand language in all of our uses of language is not by sort of directly decoding the signal in the words. It might seem like it, but that’s not what’s going on. What’s going on is we are collectively building a common ground between us or if it’s asynchronous, we’re imagining a common ground, and then saying, “All right, given that common ground, given what I know about that other person, what were they likely trying to communicate by picking those words?” In other words, we are imagining a mind behind the text in order to understand it and we apply those same skills when we see the output of ChatGPT, which is hard. And on top of that, OpenAI is leaning into it. They’re not guarding against our tendency to anthropomorphize. For example, there’s absolutely no reason the chatbots should use first person pronouns. It shouldn’t say, “I’m a language model.”
Ricardo Signes: Right.
Emily M. Bender: It should say, “This is a language model.”
Ricardo Signes: I’m just processing all that. There’s a lot going on that makes me feel uneasy. There’re bad metaphors in place that push us to misunderstanding. There’re cognitive biases where it’s like this thing seems to speak English, there’s probably a brain in there. And as you say, it feels like there’s sort of a coercive element of yes and we’re leaning on that to get an outcome that the producer wants. With all that in place, what can we — everybody who has to go deal with the internet — what can we do to make informed decisions about how, it sounds like maybe whether, to engage with these language model systems at all?
Emily M. Bender: Yeah, I mean, I don’t. I make a point of not reading synthetic text because why would I spend my time that way? I understand that people had some fun playing with it, but partially, because of the public scholarship that I do, I got bombarded with people suggesting that I read this and the other output, especially in the early days. It’s like, “No, I have better things to do with my time.” I think it’s really important to keep reminding ourselves that synthetic text is not an information source, that the only thing that ChatGPT has information about is the distribution of words and text and holding that in mind is hard, especially when it seems so plausible, but that’s one thing.
Emily M. Bender: And then, another thing is to really work towards transparency. So, to say, you know, we have a right to know when we’ve encountered synthetic text. That right is not being upheld in the moment, but we can insist on it and we can individually work towards it, right? So, if you were going to share some information, you can investigate its provenance, where did this come from? And then, share that metadata along with the information. So, sort of overtly valuing authenticity I think is really important.
Ricardo Signes: On the subject of citation, beyond just me as a person citing the non-synthetic source I gave to you, an argument I’ve seen a number of times is people saying, “Well, the generated text will be more trustworthy if the language model can cite its sources.” Putting aside what you’ve said, which is basically the look at statistics, it’s not a researcher. Can it do that? If we look at the model, is it built in such a way that we can inspect the answer?
Emily M. Bender: No, absolutely not. You can build it to output sources, but that’s not the same thing as citing its sources. That’s, “Okay. What’s a likely string to come here?” And there’s all these stories of ChatGPT making up citations and then people going to the library and saying, "I need to find this work. ChatGPT said it’s a thing,” and the library is going, “Actually it was just made up.” Then, of course, there’s the whole story of the lawyer who got in trouble by asking ChatGPT for some precedent case law and it happily outputs something. It looks like precedent case law. Emily M. Bender: It’s not a big database of information about the world or about things people have said about the world. It’s a big database, or collected information, about which words tend to co-occur with which other words in its training data.
Ricardo Signes: So, now, we’ve got this idea that someone, I won’t say we because I hope you and I have not done this. Someone has built a so-called information desk or we can go and pose a question and be given an answer. And we’ve established here that the answer cannot have a straight line from raw material to a reason to believe this answer. But people are in a position to take that as the truth. And we haven’t built this corpus from the whole internet. It seems like some large swath of indiscriminately gathered data. What are the dangers to the consumers of this data based on where the corpus came from?
Emily M. Bender: One of the things that happens in representing the distribution of words and text is you are going to get reflections of various kinds of social biases. So, one example that I associate with Meg Mitchell is the phrase “woman doctor” is actually far more frequent in English than the phrase “man doctor”, because even though the medical profession is actually not so gender skewed at this point, it is still treated in our discourse as an unusual thing for a doctor to be a woman.
Emily M. Bender: And conversely, no one says a “woman nurse” or a “female nurse”, but you definitely hear “man nurse” or “male nurse” because of perceptions of how those professions are populated. And that’s a pretty mild example, but those kinds of distributional biases are in there and you’re going to then get them in the output. And if you’re using that output to create fiction where you don’t particularly care about, you’re not taking it as true claims about the world, like, which events happened or didn’t happen. But we don’t expect facts in fictional stories, but we do expect truths about the world, right? And so, if we’re generating stuff that we’re understanding to be entertainment, we’re understanding to be fiction, but it is reproducing various patterns and systems of oppression, that’s bad. And the less care we put into curating training data, the more of that there’s going to be.
Ricardo Signes: What does curating the data look like? Do we enforce reverse biases? Can we eliminate? Can we choose the biases we want, which feels like maybe a manipulative problem?
Emily M. Bender: So, first of all, there’s no such thing as an unbiased dataset. And secondly, picking without thinking about that is not the neutral stance, right? There’s no neutral stance here. The way to curate a dataset is to think about, "Okay, what’s the purpose that I’m building this for?” And then what kind of data would be appropriate? Who are the people who need to be in the room? Whose viewpoints are going to be affected here? Viewpoints isn’t the right word there. Whose lives are going to be impacted by the use of this technology and what does that mean about how we want to design our dataset and then test it for various biases?
Emily M. Bender: Abeba Birhane points out that machine learning as a practice is inherently conservative in the sense that it is an implementation of taking patterns from the past and using them to affect the future.
Ricardo Signes: Right.
Emily M. Bender: So, if we are going to be using pattern recognition or synthetic media to do things in the future, we should do it in terms of the future we want to build and not in terms of just like, “Well this is how the world is, so we’re just going to keep doing that.”
Ricardo Signes: Well, and I think that there’s a next order problem, which is a thing that’s been on my mind a lot, which is there are these models that are taking in large amounts of search results and building a predictive text model to generate more text for us and then we ask it questions and it gives us an answer with no citation, no paper trail back to where it came from. But meanwhile, it sure seems like people are producing a lot of synthetic text to publish on their blog, on sources of information, which I would assume are now being re-consumed into the next order of text generation. And at some point, it seems like there’s a vicious cycle that I would expect of a blurring of information. Is that, is there a solution to something to prevent that?
Emily M. Bender: So, I think the solution would require watermarking at the source so that we can definitively filter, or to a high degree of accuracy, filter the stuff that’s come from the synthetic media machines. And, yeah, it’s all over the place. I don’t know if they’ve been taken down or not, but for a little while at least there were synthetic text mushroom foraging guides that were being published on Amazon as if they were actual books.
Ricardo Signes: That sounds dangerous.
Emily M. Bender: Yes, very dangerous. And oh, there was something like, I don’t know if this is still there either, but if you searched on Google, name a country in Africa that starts with K, what came back was this absurd statement, “There are no countries in Africa that start with K.” And if you click through that was a ChatGPT output that someone had posted on some site somewhere and it became somehow the first Google hit. The way out of this is insisting on transparency. They’re basically setting up a regulatory regime where you can always tell that you’ve hit synthetic media, and so therefore, you can on a one by one basis spot it and know it, but also systematically filter it out just like you might have an ad blocker on your browser, you could have a synthetic media blocker.
Ricardo Signes: That sounds great. Knowing what came from a person and what came from a random number generator sounds good. But I’ve read, I hesitate to say articles. I’ve read stuff on the internet where I’ve seen people discussing the idea that students are now passing in work where they’ve said, “Please write my term paper.” And the countermeasure is that the professors will take the work and pass it to a synthetic text detector. How does that work?
Emily M. Bender: Not well. Those synthetic text detectors are not accurate and it’s not a good pedagogical practice to be policing students in that way, I don’t think. The sort of most famous story was professor at Texas A&M who last spring suspected his students of using ChatGPT to write assignments. And so, he asked ChatGPT if it had written those assignments and ChatGPT said yes and he failed the students.
Ricardo Signes: Wow.
Emily M. Bender: Which really means he failed to understand the technology that he was working with. But there’s repeatedly stories of people who are being falsely accused because some instructor put their homework into one of these machines and it said, “Yep, this one’s fake.” And then, the students have really no recourse. So, that’s not the way out of it. Part of what’s going on there is there’s no watermarking at the source, right? If we had a definitive tag that things came along with, now people are still going to try to remove watermarks. There are some really interesting proposals. One of the authors is Tom Goldstein and the paper was at ICML this year for how to put in a much more difficult to remove watermark as opposed to just some text at the beginning.
Emily M. Bender: It is possible to, I think, do more than is being done in terms of marking things at the source. There are always going to be bad actors who create these systems and don’t embed the watermarks, but that doesn’t mean that it should just be a free for all, right? If we think in terms of the pollution metaphor and pollution of the information ecosystem, having some oil companies doing oil spills doesn’t mean that other oil companies are off the hook and can just spill away, right?
Ricardo Signes: Yeah. You wrote a paper with Timnit Gebru called On the Danger of Stochastic Parrots and that came out in 2021? That was before ChatGPT sort of blew up and made it feel like a watershed event for large language models on every website. And reading that paper, the sentiment seemed to me, “Hey everyone, we should be careful in what we’re doing.” And looking at the internet today, I feel like, “Well, the genie has been let out of the bottle.” And as you say, we can’t… that doesn’t mean we can do nothing, right? Because someone is spilling oil, that’s bad, everyone else should stop, but what are the things we can do?
Emily M. Bender: Yeah. So, I think we really do have to look to regulation. Another environmental example that I like to point to is the ozone layer. That was really bad in the 1980s. We had a huge problem with the hole in the ozone layer because of technology that had been developed and was being widely used and we put in regulations and we stopped and we fixed that. Similarly, we had a lot of problem with lead pollution because of leaded gasoline. Put in regulation, it was ubiquitous, it was embedded in the infrastructure, and now, it’s basically gone. So, we can do this. It takes a lot of collective political will, but we can do it. Yeah, it’s certainly not fun to be in the position of having written a paper. We wrote the paper in late 2020 and it was published in March of '21 saying, “Hey guys, don’t do this. It’s a bad idea.” And then, having it happen anyway is not fun.
Emily M. Bender: But I’m also not giving up. I think that we need regulation. We need accountability. So, one of my regulation, like, dream items would be what if a person or organization that sets up a synthetic media machine is actually accountable for what it says as if they had said it themselves? That would change things in a big hurry.
Ricardo Signes: Who are the people, who are the agencies in the world that would need to be on board with this kind of regulation?
Emily M. Bender: I think we’re talking about national governments and then possibly something at the super-national level. And I know there’s a lot of conversation now happening about regulation. Some of it is nonsense about existential risk. But there’s also the US FTC is doing great work. Lina Khan is amazing saying, “Hey, you know, it’s our job to regulate the activities of businesses and just because they’re using automation, so-called AI to do those activities, it doesn’t take it out of our jurisdiction.” So, there’s a lot that can be done. The lawyer who queried ChatGPT for precedent case law and then submitted it, got in trouble because he’s working in a well regulated area. I think the question is where is there not yet enough regulation and whose jurisdiction does it fall under? And that needs to be done, I think, country by country and then probably also at an international level.
Ricardo Signes: There’s a lot of stuff on the internet that regulation tells us shouldn’t be there. There’s pirated material, there’s hate speech, there’s instructions on how to commit crimes. Is that an indication that we’re always going to have synthetic text posing as real text and regulation isn’t going to stop it? And this is back to our less is better than all. How do we think about this?
Emily M. Bender: I think this is not going to be something that we could ever get rid of 100%, but that’s one of those things where less is definitely better than all. You know, that’s most things in life. I’ve spent some time thinking about what would be a safe and beneficial use case of synthetic text, and that’s hard. So, it would have to be a case where either it’s really only about the form, you don’t care about the content or the content can be effectively and thoroughly vetted. It would have to be a case where you’re not concerned about plagiarism and setting up someone else’s ideas as your own. It would have to be a case where the kinds of biases that are going to come out either don’t matter somehow or again, can be thoroughly mitigated. And it would have to be a case where you don’t care about the data theft and labor exploitation that underlies the system.
Ricardo Signes: Right. Okay.
Emily M. Bender: Or you found one of these where that’s not true. That really narrows things. So some of my candidate cases, sort of assuming we have a version of this that was built without data theft and labor exploitation. The other criteria you think might be satisfied in like a dialogue partner for a language learner or non-playable character in a video game. But there we’re again really concerned with biases.
Emily M. Bender: Even though, yes, I don’t think I’m getting real information there, there’s still going to be this like, “Oh, that’s how we talk about the world in this language.” Or, “That’s the world this video game creator has produced.” So, one of the better ones that I have come across is “tip of the tongue” searches where what you’re searching for is the name of the thing, and you can describe it, but you don’t know what it’s called.
Ricardo Signes: Yeah, okay, that sounds great.
Emily M. Bender: That’s really hard to do with a regular search engine. That’s an idea that came to me from Daniel Midgley. So, that seems like a good use case, partially because then you could immediately turn around and do a regular search on that term. Again, provided you’ve gotten an ethically produced one of these things. That’s super narrow, right? Oh, we haven’t even talked about the environmental impact of building these, data centers involve an enormous amount of power and water usage and, you know, that’s going to involve carbon footprint. And even if they’re running off of renewable energy, well that’s renewable energy that couldn’t be used for something else because we don’t have yet a capacity for more renewable energy than we actually need. You put all of that in to get something that helps you with tip of the tongue searches. Like, I’m just not convinced. Now, language models that are used for classification rather than text synthesis definitely have a role, right? This is part of automatic speech recognition. If you want good automatic transcription, a larger language model is going to help. You want good machine translation, it’s going to help.
Emily M. Bender: You want good summarizations, kind of on the border because the way these things are used for summarization can certainly input a bunch of artifacts that aren’t there in the thing you’re summarizing. But that kind of language technology is made better with better language models. I believe we have overshot the size of language model that is really helping with that. Again, especially given environmental concerns, especially given the inability to collect the data carefully at these scales. But there are certainly positive use cases for language models and even larger language models, just not when they’re being used to create synthetic text.
Ricardo Signes: Well that’s good. I’m glad it was Built for — they can do something good that we built it for. Okay, so the show that you are on right now is Digital Citizen and we talk about digital citizenship and that’s, to us, that is how we engage with each other online and how we engage with a culture that exists online. And one of the central questions for us is what can we do as individuals or as parts of that community to be better digital citizens? And I wonder if you have advice on that topic?
Emily M. Bender: One thing is just being mindful of the fact that synthetic media is not information and that nobody has accountability for it, unfortunately. I’d love to fix that. And so, keeping that in mind as one goes about their day, you know, on the internet, I think can be really helpful. I would say it is an act of good citizenship to avoid using products that are built on data theft and labor exploitation. That is another kind of relationship that’s worth being careful about. I think we can, on an individual level, insist on transparency, moving around with the expectation that we should be able to know when we’ve encountered synthetic media, even if that’s not presently the case. I think that leads us to ask good questions.
Emily M. Bender: And then, as the counterpoint to that, we can again on an individual level, build up the value of authenticity and make a point of insisting on data provenance. And this again, not that new, right? If you think about the way misinformation flies around social media, one way to counter that is to not share something if it’s not citing its sources and similar habits, I think of what you might call information hygiene apply here as well.
Ricardo Signes: I am reminded of what you said about holding the text synthesizer creators responsible for the synthesized text. Maybe we can hold ourselves responsible for what we share and understanding what’s going on there.
Emily M. Bender: Yeah, that’s a good practice. And I think to get out of this, we are going to need collective action and regulation and I really appreciate the efforts towards building information literacy. That’s a key component, but I don’t want it to be the only solution that we’re putting in place because you can’t “good behavior” your way out of these kinds of structural problems. Though certainly good behavior can help and it can help both in terms of it’s actually helping the larger problem and it helps keep one’s spirits up to have positive things that you can do.
Ricardo Signes: All right. I hope everybody enjoyed our final episode of the season.
Haley Hnatuk: I had a great time listening to your conversation with Emily. And I can’t wait to hear what our listeners think of this show. We’ve had a really, really great third season. And Rik, I’d love to hear about something that you learned.
Ricardo Signes: I think the big things for me were the conversation about large language models included a sort of a sudden conversation about watermarking. And how you can make it possible to understand whether text was synthetically generated that was very interesting to me. And it sort of led me down a series of rabbit holes trying to understand the exact practice behind that that’s been very interesting. On a more practical note, I think that talking about the right to repair reminded me how much time I used to spend doing computer repair, hardware repair fiddling with my stuff, and how little I do that now. And it’s been a bit of a prompt to try and pay more attention to those kinds of problems. It hasn’t been entirely paid out yet. But now I’m spending a little more time trying to think about the fact that I can fix so many things that I otherwise might just replace. What about you?
Haley Hnatuk: Yeah, actually after your interview with David from Morgen, I started implementing time blocking into both my personal and professional calendars and it’s really changed the way that I get things done. And the way that I think about my free time I’m not quite at that place where it’s not in my calendar, I’m not going to do it. I’m not quite using my calendar as my to do list yet, but I do think it has been helpful trying that productivity style out. Well, speaking of these great tips, what do you think the key takeaways of today’s episode were, Rik?
Ricardo Signes: Well, first off, I think, don’t just trust what you get out of like ChatGPT. It’s useful to think about it kinda like a Magic 8 Ball, you can ask it whatever question you want and get a plausible sounding response. But it’s not a response. It’s coming from an expert who has an understanding of your questions. It’s just text that’s been generated based on a bunch of other text. And so you need to really think about critically analyzing this response that you got back with the your own human brain.
Ricardo Signes: And because of that I think that we have a right to know when we’ve been handed this synthetic text. And I don’t think that’s really being respected right now. But it’s something that we can insist on, and that we can let people know that we expect. We can build up the value of authenticity, right? That knowing that something someone purports to have produced, they really did produce, and that when we are given text, or data or report, make clear that we want to know where it came from, and how do we know we can trust its origins?
Haley Hnatuk: Well, we really hope that you can take these actionable steps towards better digital citizenship. As we said earlier, this is the final episode of our regular season. Do you have any feedback for us or have an episode that you want to talk about? Or just a moment that stuck in your mind for the last couple of months? Go to digitalcitizenshow.com/survey. That’s digitalcitizenshow.com/survey and tell us what you think.
Ricardo Signes: Yeah, that’s it. I guess we’ll see you next season for more conversations about how to live your best digital life and until then, good luck.
Ricardo Signes: Thanks for listening to Digital Citizen. Digital Citizen is produced by FastMail. The email provider of choice for savvy digital citizens everywhere. Our show is produced by Haley Hnatuk. Special thanks to the incredible team of people behind Fastmail. Digital Citizen is hosted by me, Ricardo Signes. You can subscribe to our show on your favorite podcast player and for a free one month trial of Fastmail, you can go to fastmail.com/podcast. And for more episodes, transcripts and my takeaways, you can go to digitalcitizenshow.com/.