Blog

Turning GPT Into Socrates

3/23/2023

Language Models

Chat GPT has the AI revolution in full swing. The question is, have we really solved the fundamental problem of accessing information and communicating it effectively?

It is exciting that we can now have rich dialogue with a computer. For two decades, we’ve asked Google what the most relevant website is to our query. Now, we ask the wise man GPT. And, we now can instruct the wise man to write for us. As exciting as this is, we are making an errant assumption that GPT is wise.

The LLM model that GPT is based on does not think. It looks for text sequences in its dataset (which are not all trusted authorities or great works of authorship) that map to users’ text inputs. There is no reasoning going on here. Its outputs are not a formulation of scholars like an encyclopedia, or a ranking of relevance like Google search. Thus, as a societal matter, we should be wary that rapidly increasing the pace of content creation based on information that may neither be factually accurate nor serve to inform the reader could crowd out the very information that made the internet useful.

There are solutions. The goal should be to create a system that can synthesize information, make it easier to find the trusted authorities, reason through it, offer up coherent perspectives, and like GPT, author works. In order to accomplish this, we have to tackle some underlying technical challenges.

First, anyone who has sat at the backend of a search engine observing user queries has realized that a ton of searches are vague, brief, and or ambiguous. When I handed friends an app with a search bar for image generation, they searched for things like ”soccer,” “woman dancing,” and “dog with flowers.” If you asked an artist to draw one of those descriptions, he would likely scrunch his forehead before peppering you with clarifying questions. There are ways to predict certain things about what you might mean based on your previous searches, previous searches by other users and external data. However, like the artist, the algorithm cannot read your mind. In sum, there is a “garbage in” problem.

As an attorney, a significant portion of my job is asking follow-up questions to gather more information. Asking context specific questions is a task of gathering necessary information while making a probability calculation of whether the person is willing or able to be more specific. Many journalists will attest that it’s typically easier to get someone to tell you a story than to describe something with specificity. And yet, software engineers are often fearful of users leaving the app by putting up too much friction in the interface by asking clarifying questions. Nonetheless, without solving the “garbage in” problem, the outputs will continue to be inaccurate, random and or not very useful.

The second challenge is structuring software to contextualize and synthesize information. The Onion, a humor publication, is not the same as the Harvard Medical Journal. Double blind studies are not the same as pop psychology that states that alcohol is healthy. These rules and hierarchies of authorities are teachable and therefore they are programmable. Relating information from different domains, formulating complex hypotheses and assembling experiments is something that deep learning is well equipped to do. Deep learning, and more broadly machine learning, are a category of statistical algorithms. LLMs are merely one tool in the toolkit, but at the step of contextualizing and synthesizing, they are the wrong tool.

Finally, expressing information in the most digestible and compelling way is another hard problem. Think about the difference of how your kindergarten teacher spoke to you, and then think about how a poet laureate writes, and then how an attorney advises her client. On the one hand, there is colorful wording and the use of a variety of rhetorical tactics–-narratives, analogies and metaphors among them—that allow you to connect with the language. And on the other, there is a spectrum of specificity to understandability. An explanation can generally be simple and reductive or complex and detailed, but not both. A superior solution would allow users to adjust the level of detail they desire. But the reduction of detail should never distort the meaning. This is particularly a concern for specialized domains like law and medicine, where the risk of error is very high.

Solving these challenges is the next frontier in language models. It is how we make the internet a better encyclopedia–instead of a graveyard of word sequences mashed together in a gazillion parameter language model with a human face. Oh, Socrates, where art thou?

0 Comments

Meta/FaceBook Can still facially recognize you

11/3/2021

2 Comments

Meta, the company formally known as Facebook, issued an update on their use of facial recognition technology. In the update, Meta announced that they will shut down the facial recognition system as part of the company-wide move to limit the use of facial recognition in their products. It is noteworthy, however, that although these features will no longer be accessible by end users, Meta likely can flip a switch at any moment and utilize the facial recognition they have developed up until now. Although the announcement indicates that Meta will delete more than a billion people's individual facial recognition templates, that does not mean that their algorithm will no longer be able to recognize these faces.

Artificial intelligence facial recognition algorithms train on data sets which map photographs to names or other personally identifying information. After the algorithm’s “weights” are trained (in this case—on a billion faces), deleting the image files will not reduce or in any way hinder the already trained algorithm. What Meta is saying in their announcement is that they likely don't need these files anymore, and to make the public think their custodial action in deleting the files is a benevolent one, they issued a public address.

2 Comments

GPT-3, Esq? EVALUATING AI LEGAL SUMMARIES

2/17/2021

GPT-3’s emergence as a state-of-the-art natural language processing algorithm has drawn headlines suggesting that lawyers are soon to be replaced. As a lawyer who spent the last year studying machine learning, I decided to put GPT-3 to the test as a legal summarizer to evaluate that claim. In this experiment, I input three excerpts of legal texts into GPT-3 to summarize: LinkedIn’s Privacy Policy, an Independent Contractor Non-Disclosure Provision, and the hotly debated 47 U.S. Code § 230 (“Section 230”).

gpt-3_esq_-_testing_legal_summarization_-_rev_3-2-21.pdf
File Size:	225 kb
File Type:	pdf

Download File

The Digital Universe as You Scroll It

6/23/2020

Do you ever open your phone and scroll through a news feed after pressing shuffle on Spotify, before moving on to flicking across the Netflix lineup for something to pass the time? Video stores, music stores, libraries and the newspaper have been reduced to a scrollbar. What appears to be a vast sea of information and entertainment available at the click of a button is instead spoon-fed to you by a statistical formula called a recommender system. Recommender systems are in theory designed to recommend content you may like based on content you have consumed. In practice, recommender systems hamper creative exploration and reinforce ideological entrenchment by myopically evaluating your digital activity and categorizing your interests.

After briefly describing how recommender systems function, I explore their pitfalls. My central argument is that we must vigilantly govern digital platforms’ recommender systems because they serve as the go-to sources of information and entertainment—setting the outer limits of creative exploration and truth seeking. Because access to information and art is critical to democracy in a modern society, the gateway platforms must be transparent in their methodology and allow users more choice. Recommender systems based on algorithmic transparency and user input can realize the unfilled potential of the internet by bringing the world’s content to our fingertips. I write not as a nostalgic luddite longing for the return of video stores and Walter Cronkite, but as a concerned citizen who wants the digital age to encourage creative exploration and ideological exposure.

Inside the Recommender Black Box

Recommender systems record what content users consume and apply statistical formulas to determine what they are probably willing to engage with. It is important to recognize that recommendations are not always explicit under category listings titled “recommended for you.” They are the very ordering of the movie titles and news articles you see on your phone and computer every day.

There are two main recommender formulas which platforms commonly employ to recommend content: content-based filtering and collaborative filtering. Content-based filtering takes examples of what you like and dislike (comedy films, female lead vocals, news about COVID-19 cures) and matches you with similarly categorized content. Collaborative filtering matches your preferences with people of similar consumption patterns and recommends you additional content that those people have consumed.

Both collaborative filtering and content-based recommender systems use what you have consumed as a proxy for what you like. They interpret your interactions on the platform, and sometimes combine that data with your interactions on other parts of the internet, forming a profile of your preferences. For example, one method is to observe which videos you click on and finish as a proxy for which videos you prefer. A better method would involve rating videos upon completion. An even better system would require a rating and reason why you did or did not like it, but participation in those surveys is spotty. Thus, actions such as playing a song twice and slowly scrolling through an article serve as workable proxies for what you like.

Problem One: Privacy & Control

Although this method of preference gathering is imperfect (what if you were cooking and could not skip a song?), the more you consume, the better the platform can determine what you consistently will engage with. By adding data from outside the platform itself, the mosaic of your preferences gets more complex, forming a profile of your defined tastes and predictable ones. Generally, the more data on your preferences the algorithm has as an input, the more tailored of a recommendation it can make. That is a slippery privacy slope, as an enormous data profile can be amassed over time by observing your habits in an effort to order your newsfeed or Spotify playlist.

Taken to the extreme, the best recommender would be inside your head, monitoring your thought patterns. We are already half-stepping towards this today. Smart speakers listen to conversations, camera embedded smart TVs can record reactions to content, web scrapers devour social media feeds and online affiliations, and Google logs internet searches. If gone unchecked, the level of psychological exploitation will inevitably grow to capture passing unexpressed thoughts, and pinpoint what type of stimuli makes you happy, sad, and makes you want to continue consuming. The more news, music, movies, and shopping platforms dominate our attention, computers and phones turn from tools into content shepherds that subtly steer us involuntarily.

What that means in practice is that we are ceding the right to filter information and media content to the companies that run the major content platforms. The complexity of these recommendation systems occults that there are people behind the wheel of the algorithms they employ, even if their systems are semi-autonomous. If psychological profiles developed for the purpose of recommender systems were to be shared with governments, employers, and schools this same preference data could be used to discriminate and manipulate. Personalized profiles could become the basis of institutional access and job opportunities, not to mention the potential for psychological manipulation of the population at large.

By sharing all of your platform usage and content choices, you are accepting that a picture will be painted of you that may or may not represent you, and you will see content through a lens you will not know was prescribed for you.

Problem Two: Loss of Uniqueness

Collaborative filtering does not consider people to be unique. By design, it attempts to match users with other users who share similar consumption patterns. However, there are preferences, such as those based on unique life experiences that are not shared with anyone on the planet. Even if you share 95% of your consumption in common with another profile, would you not prefer the freedom to scan the random content universe when you go searching for creative inspiration on Spotify, or the truth in your news feed, as opposed to being matched with your 95% consumption twin? Collaborative filtering makes the 95% similar person’s consumption habits the entire universe of news, music and shows available on the screen.

Consider that your friends are likely not anywhere near 95% similar to you in consumption patterns, and yet they can serve as an excellent source of recommendations. Why? Take news articles for example. You may disagree with a friend about a certain political candidate, but you may also be interested in the normative arguments they make by analogy to their life experience growing up in a small town in Mississippi, and you may be intrigued by the factual evidence they cite in favor of tax breaks to big businesses as a means to generate economic growth. That builds trust in their recommendation. Based on their analysis, you might be willing to read an article they recommend which discusses an instance of when big business tax breaks led to a boom in economic growth and the rise of certain boomtowns. You can evaluate the value of your friend’s recommendation considering what you know about them, as well as your knowledge of history and economics.

Collaborative filtering builds no such credibility of authority nor offers a synopsis of its reasoning. It removes the rationale from the recommendation. The collaborative filtering universe of options does not expand unless your digital twin searches for something outside of their previously viewed content. In other words, relying on collaborative filtering for recommendations ensures there are no hidden gems in your shuffle playlist or diversity of opinion in your newsfeed unless your soul sister from the ether searches for something outside of the recommender system.

Problem Three: Preference Entrenchment

Content-based recommender systems can broaden the collaborative filtering universe of recommendations by going beyond your consumption twins, but they myopically focus on the categorical characteristics of content you previously consumed. By doing so, content-based recommenders entrench you in your historical preferences, leaving no room for acquiring new tastes or ideas. Consider that if you have only ever listened to electronic dance music on Spotify, you will be hard pressed to get a daily recommendation of jazz. More likely, your playlist will be a mile-long list of electronic dance music. To get recommendation of a new type of music, you have to search for it. Unlike record shopping of old, you would never see the album art that caught your eye on your way to the electronic music section, or hear a song playing in the background of an eclectic record store.

This preference entrenchment problem has deleterious consequences in the dissemination of information in the news. For example, a frequent New York Times reader who only clicks on anti-Donald Trump articles is likely only to be recommended more articles critical of Donald Trump. There are no easily accessible back page articles in a newsfeed. The recommender system does not allow for an evolution of political views because it is only looking at users’ historical preferences. That is a problem because one may not yet have formed an opinion on something one has not been exposed to. By tagging what you consume categorically, you are being involuntarily steered and molded into categories that may not represent your interests now, or in the future.

Problem Four: Hyper-Categorization of Preferences

Preferences are not always so clear cut. Data scientists love to define more granular categories of content to pinpoint preferences. The architect of these feature-definers is often a data scientist who occasionally may be aided by an expert in the field. The data scientist’s objective is to train a machine learning algorithm that automatically categorizes all the content on the platform. For example, Spotify might employ a musician to break down songs by genre, musical instruments, tempos, vocal ranges, lyrical content, and many other variables. The data scientists would then apply those labels to the entire music library by utilizing an algorithm designed to identify those characteristics, occasionally manually spot checking for accuracy.

But, are those categories the reasons why you like a given love song? Or might the lyrics have reminded you of your ex-girlfriend or grandmother? The mechanical nature of breaking content down into feature categories often weighs superficial attributes over the ones we really care about. The flaw in the mechanical approach is subtle: by focusing on the qualities of the grains of sand (song features), recommenders fail to recognize they combine to form a beach.

In context, a great movie is not great merely because it is slightly different than another with a similar theme and cast. It is great because of the nuanced combination of emotional acting, a complex story, evolving characters, and a climax no one could have anticipated. Similarly, local artists are not merely interesting because they are not famous. Their music might be interesting because their life influences and musical freedom make their music rawer than what is frequently listened to on Spotify. Those are difficult characteristics to describe for a human, let alone a recommender algorithm. Recommenders struggle to categorize descriptions like “beautiful,” “insightful,” and “inspiring” because they are descriptions of complex emotions. Thus, their bias towards clear-cut categories and quantifiable metrics makes them poor judges of art.

It is no surprise then why recommenders are horrible at disseminating information. News articles can be broken down based on easy to identify categories such as publication source, quantity of links to other articles (citation count) or mentions of politicians’ names. Those categories might loosely relate to quality and subject, but they would hardly indicate a reason to base a recommendation. Moreover, these attributes do not signal the relevant attributes in news such as truthfulness, good writing, humor or demagoguery.

Because news is constantly changing, it is even harder to categorize than a static environment like in music or movies, which can be manually tagged ahead of time. Facebook, Google and Twitter are in an ongoing battle to tag factual accuracy in news articles, particularly because it involves large teams of real people Googling what few credible information sources remain (instead of an algorithm trained to detect the presence of a trumpet in a song). The data scientists employed by these platforms seek to automate categorization whenever possible, but when truth lies in the balance, the stakes are too high. And yet, with screening an ever-growing amount of information, the platforms are struggling to keep up, especially in complex subjects like COVID-19 cures, which require expertise to comprehend. At present, we are allowing recommender systems, aided by data scientists and contract-employee category taggers, to shape public perception. They do not appear to be helping expand access to information.

Problem Five: The User Interface is The Content Universe
Shifting from retail stores and newspapers to screen scrolling may have kept us from leaving the sofa, but it did not make finding new content altogether a better experience. Strolling through music and video stores of old allowed for quickly browsing spines and close inspections of covers, and the sheer physical nature of the act made it feel more intentional. The advantages of the physical browsing experience are why DJs still love record stores and intellectuals will not let bookstores die. The face-out titles were the curated and popular ones. The more obscure titles were deeper in the stack. It may seem counterintuitive, but the inconvenience of digging made finding the hidden gems more rewarding.

Although the subscription all-you-can-consume business model conveniently allows for casual previews and skim reading, one can only finger swipe through movie titles and cover pictures for so long before clicking. Thus, the order of songs, movie titles and articles is highly influential. Clickbait media works because of the equal weighting of articles in a scroll feed. Tabloid news magazines used to stand out with their highlighter colors next to the candy bars in the checkout line at the supermarket. Today, they often appear at least as often as the New York Times or the Economist in your feed. Recommender systems will reinforce clickbait tabloids over long form journalism without batting an eye simply because they are more frequently clicked on.

Digital platforms fail to recognize that there is a degree of stewardship in curating news content. The content universe is—and always will be—curated. The status quo means trusting the data scientist architects and their recommender black boxes that influence what you see when you read the news with your morning cup of coffee, sit down at the end of a long day to enjoy a movie, or turn press shuffle to zone out while working. Moreover, it also means accepting that the elaborate apparatus of data collection of your preferences should continue to the extreme of understanding your psychological programming and turn the digital universe into a happy pill time drain—or worse.

It is important to note that the biases of recommender systems are not always intended consequences. They are in part limitations of black box algorithms too often left unsupervised or under scrutinized. Engineering oversights happen. Applying recommender systems to the structuring of newsfeeds shifted the way a large portion of the population sees an issue like the viability of vaccines, but that was not likely the platform architects’ intentions. Nonetheless, it is the result of neglect of platform engineers, managers, and executives. The more authority over the flood gates of information are reduced to mathematical formulas contained in black box closed sourced programs, the more likely it is for this neglect to occur.

It is vital that we push back to gain control of the digital universe. It is a misconception to think that we have a world of information and content in our pockets if every scroll is based on a recommender system backstopped by a small team of people screening fringe content. The internet dominated by platforms is creating shepherds, not moral stewards. We must subjugate platforms, and their algorithms, to democratic governance and require that they be transparent in their processes to ensure that the potential for creative exploration and dissemination of truth is enhanced by their emergence as a pillar of modern life.

Democratic Platform Stewardship: An Alternative to Recommender Systems

I do not intend to suggest that the internet should merely look like the back shelf of the library. If instead of being steered by recommender systems, users are given the reigns to select their preferences, the system becomes a useful tool instead of a dictator of preferences. The categories recommenders use could easily be made available to users. Instead of endlessly harvesting data on users’ habits to feed recommender systems, users could select and choose their own preferences. In the interest of privacy, platforms could be required to not record those preference settings. The common big tech response is that the data they collect helps makes services cheaper. Privacy and control are worth a few extra dollars a month.

Another solution is altogether more democratic. Users could score content for quality, truthfulness, and other relevant categories for the medium. For a more intimate rating system, users could opt into friends and interest groups to get community recommendations that may deviate from popular opinion or involve groups of credentialed critics. Smaller community discussion could provide useful background information and prevent majority groupthink and interested parties from dominating the narrative. These groups are prone to becoming self-isolating echo chambers, so it is important that they remain public. Encouraging discussions would remove us from the infinite scroll of provocative images and titles pushed by advertisers and reinforced by recommender systems.

Newsfeeds require particular attention because they are key sources of information for many. One of the big takeaways from the Cambridge Analytica and Russian election interference scandals was that the propaganda that gained the most interest was found to be that which promoted ideological extremes and reinforced scapegoats. News feeds today are too much driven by an advertising model based on click-through rates and comment engagement, to the detriment of critical thinking and the dissemination of truth. Recommender systems left to their textbook formulas may be good for engagement but are bad for the spread of truth. That is not acceptable.

One potential solution in newsfeeds is to have community-based experts score each news posts for truthfulness and have users score for ideology, with a tiered system of users to include community elected expert moderators whose scores are given extra weight. The presence of moderators can provide a check on exploitation by any party who might seek to influence the platform by voting with fake accounts. All news feeds should seek to be balanced in ideology, but always attempt to be truthful. Balancing ideology aims to properly give readers both sides of a given issue. Bias may be impossible to eliminate, but stewardship in the curation of the modern newspaper is essential. Balanced journalism, however difficult to achieve, must at least be strived for in the digital age.

Where Everybody knows your name: Facial Recognition on Demand.

9/13/2019

You’re at a cocktail party, invited by a friend you just met. A chap comes from across the room wearing the latest Google Glasses and introduces himself after having looked up everything about you on the internet on his way over. Politicians, business leaders and celebrities go through this on a regular basis, but some of us enjoy relative anonymity, whether by avoiding social media (or making your profile private), moving to a new place, or attending an event of strangers. The key question is: do we have a right to be a stranger in public, and if so, how should we protect that right?

Few people remain absent from every photo database, be it a yearbook or Facebook. And yet, when photos are taken, most of us don't expect that they will end up as a part of a database used to identify us at a cocktail party, or walking down the street. That expectation is changing, as a much larger amount of photos are being stored and shared on internet databases. Even if these photos are not public, they have the potential to end up in a database somewhere, forever tagged to your name. What could once be buried in boxes of stuff from yesteryear, we can assume are being, or will be, stored and cataloged forever.

Moreover, augmentations will surely provide the means to accomplish cocktail party facial scans. The facial recognition tech, however flawed, exists today. Google Lens exists today. The wearable or implantable version is inevitable. If the database and the tech are both inevitabilities, what stops these strange encounters and the accompanying pervasive social effects? Privacy minded people will point to collection as the first line of defense. How can we conceivably stop the scanning, posting, and mass tagging of our photos? We've already come to accept that as common practice on social media. Once someone somewhere has placed a photograph into a publicly searchable database, we can assume someone somewhere has scraped it and archived it. Therefore, if one clear photo of you has made it on the internet that at some point facial recognition technology has associated with you, the cat is out of the bag.

One approach to prevent public facial scans is penalizing the app provider. The simple solution would be to ban facial recognition apps and databases by placing heavy fines for violations. Facebook today may have the largest repository of photos associated with names outside of government issued identification databases. Allowing an opt-out as some tech companies have pitched, is insufficient. A data scientist equipped with enough uniquely identifying variables could likely reassociate your photo to your name. And, once that association is made, the government would have to stomp out every facial recognition tool and its database with the same vigilance it pursues child pornography. See the problem? An opt-out lets the cat out of the bag.
Alternatively, or in conjunction with a ban on a facial recognition tool, we could ban the end-user from using such tools. The enforcement of this would mean we would have to police peoples phones, glasses, and maybe one day their brains. That does not seem workable. Threatening app providers and database hosts with strict penalties is a more workable solution.

Some people may enjoy being a stranger, while others (social media stars) may want to be found in facial recognition database. Whether we should define a right to public anonymity is up for debate. If the majority of people opted to be searchable, you might one day be avoided in public if you can’t be found because some may think you have some nefarious past to hide, or are as lame as those of us who don’t have Facebook pages in 2019. I’m content if opting out of facial recognition makes me not cool. I’m not cool if the world is like Cheers, “where everybody knows your name.”

Who should the Tesla-Trolly Kill?

8/13/2019

Your self-driving car is cruising along at 35 miles per hour while you’re watching Netflix until it confronts the inevitable: three school children run out onto the road. Your car has a few options: (1) turn off the road to the right and run into a tree and likely kill you; (2) keep straight and run the three kids over, likely killing them; or (3) turn hard left and pile into the four business people heading to their favorite lunch spot. This moral dilemma is a derivative of the “trolley problem.” In the era of self-driving vehicles, we may be able to predetermine which lives are more valuable. In other words, the out-of-control trolley conductor’s split-second decision may be preprogrammed into automated vehicles.

Do you have a knee-jerk answer about who should be saved? Good. The harder question is determining why you think the way you do. Do you favor the children because they are younger than the businessmen, and thus you value younger lives? Do you favor the businessmen because there are more of them, or because they contribute more today to society? Do you believe the driver should sacrifice himself in order to die a hero? Would it change if he was the pope? Mercedes announced that they will program their cars to save the life of the driver in inevitable crash scenarios. That makes good sense, otherwise buyers might be deterred from buying those vehicles. However, eventually this may become a regulatory matter beyond the role of automakers.

I suggest self-driving car systems value first crashing into the least likely to die, and second value saving the greater number of people. Valuing people differently based on their actuarial value could be taken to an extreme. Doing so would have to rely on a database of personally identifying information or otherwise make superficial real-time estimations. Accuracy problems and algorithm biases could run amuck. Saving the driver at all costs, as Mercedes suggested, could lead to robotic cars plowing over school children. In contrast, the utilitarian calculation is simple and doable: save the most lives. Ultimately, in a fully autonomous future, one hopes that these accidents are few and far between.

Know Thy Data, and Theirs, Too.

7/17/2019

Your data is valuable to others who want to sell you things. But, is it valuable to you? Smartwatches’ ability to track sleep patterns, movement and heart rate variation give us some greater understanding of our physical health. Netflix and YouTube recommendations sometimes help narrow down the “what to watch” list. A much more valuable use of your data could be to connect you with like-minded people.

A key point to understand is the difference between personal and comparative data. Personal data points can be interesting (“My max heart rate is 220bpm!”), but what’s far more interesting is comparative data. The first step would be to see how your maximum heart rate compares to the general population. That is not too helpful when the population includes both loafers and Olympic athletes (too wide of a range). Therefore, the next step is to put you in a subgroup that makes your data more relevant, such as age, activity level, etc. With the ability to control for these subgroups, you might be able to observe how your performance stacks up against fellow 30-something swimmers, whether you’re “healthy,” stressed at odd times, or capable of winning a marathon.

Health data only skims the surface of what we could gain if we better understood our data and could compare it with others. How we think can be deduced from our Google searches, grammar structure, music-movie-book selections, time spent at given locales, photo taking habits, dietary patterns, etc.

Wouldn't you want to know if there was someone on the other side of the world that was just like you?

The ability to learn might be vastly accelerated by finding a model person to imitate who is more like you than your hero. You might also be able to find the ever-elusive soul mate, business partner, or best friend a lot faster than the brutish ways of modern networking.
Connecting through correlations might prove less enchanting than happenstance encounters. But if targeted advertising sells stuff more efficiently, finding the right people might be aided by knowing thy data, and theirs too.

Operating system data collection is the most grave threat to privacy

11/16/2017

4 Comments

An operating system is the all-seeing eye and software brain of a computer. From the moment a computer or smartphone is powered on (and potentially even when it is “sleep” mode) the operating system acts as the switchboard for every mouse movement, keyboard button pressed, sound within the microphone’s range, sight within the camera’s view, screen output seen by the user, and data bit stored on the device. The ability to survey at the level of an operating system is equivalent to the reach of a camera with x-ray vision in the home, able to scan a diary within microseconds. If we do not act to protect operating system data, privacy in the modern age is meaningless.

Although there is no true comparison to the amount of data an operating system can collect, the only one that comes close is an internet service provider (ISP). The Federal Trade Commission has stated that “large platform providers [like internet service providers] that can comprehensively collect data across the internet present special concerns.” IPSs sit at the gateway of the internet, routing data from individual users to the rest of the web. A leading academic called ISPs the “single greatest point of control and surveillance.” Yet, ISPs pale in comparison to the offline reach of operating systems.
ISPs inherently involve transmission of data, but operating systems do not. Operating systems operate offline and do not transmit data unless an internet application is used. For example, journalists’ word processing activities and filmmakers’ editing processes take place offline. Simply because a computer or smartphone can be connected to the internet does not mean that all of its activity should be subject to surveillance. Driving a car onto a public road does not entitle anyone, much less a car manufacturer, to access everything one has ever done or said in the car, especially while it was in his or her garage. Operating systems create this virtual space on enclosed private property—our hobbies, inventions and passing thoughts—that should be kept free from prying eyes.

Absolutely Necessary to Properly Function or Secure Standard

Operating system data collection should be regulated by a rigorous standard: data should be collected only if it is absolutely necessary to ensure the functioning or security of the operating system. Microsoft has taken the position that it needs the data it collects in its Windows 10 operating system in order to diagnose the causes of computer crashes and to deliver security updates. However, among the data Microsoft collects is information that is not necessary to diagnose crashes or security. For example, Microsoft collects text typed in an address bar or search box in a web browser, as well as incoming and outgoing calls in Skype, in addition to document reading activity. A Microsoft spokesperson recently publicly stated, "In the cases where we've not provided options, we feel that those things have to do with the health of the system.” It is simply not true that the aforementioned data are necessary to ensure the health of the system. In order to diagnose a crash, only data pertinent to the crash need be reported. That means a few second snapshot of activity closely related to the event that triggered the crash. The interest to maintain the health of the system should not lead to overinclusive collection that invades users’ privacy.

Require Consent for Every Transmission

Operating systems should require consent before transmitting data. Microsoft Windows historically only transmitted crash diagnostic data after an issue arose and with explicit user consent, but it has since changed its approach and now transmits usage data automatically without consent. Microsoft responded to recent public outcry over automatic transmission of user data in Windows 10 by including some privacy opt-out functions; however, Windows 10 still does not give users the option to disable all data transmission as it did in the past. Microsoft is not alone in eliminating consent requirements from its operating system. A software engineer recently discovered that his Android telephone was transmitting data to the operating system developer without his knowledge or consent. Simply put, there is no reason users should be deprived of an off switch for these transmissions.
Microsoft claims that it seeks to make the “experience better for everyone” by collecting everyone’s data. This is similar to car company monitoring the driving habits and accidents of the vehicles it sells in order to measure the reliability of their cars. Car companies are required to comply with rigorous safety standards before distributing their cars instead of using the public as test dummies. The same rules should apply for operating systems. We should not be Microsoft’s test dummies at the expense of our privacy, unless we choose to do so.

Require Consent for Log File Recording

Operating systems should not store long-term user activity, unless the user consents to the storage. Microsoft claims that it “tries to avoid collecting personal information wherever possible (for example, if a crash dump is collected and a document was in memory at the time of a crash).” That means users are expected to rely on Microsoft, Apple, Google and others operating system developers’ goodwill to filter out the patent application or love letter being written at the time of a crash, after the operating system has already read its contents. Moreover, if an operating system stores users’ data, hackers may have the opportunity to intercept a large historical mosaic of personal information. Operating system developers can also entice or mislead users to share their log data years after the fact. At that time, users may not fully understand the breadth of the data that will be shared with the developer.
The hard truth of the digital age is that without knowledge of the operating system code (which is often inaccessible intellectual property), or the use of sophisticated programs to audit the operating system, it is difficult to determine precisely what data operating systems collect and transmit. Digital switches do not function like circuit breakers of old, where the flick of a switch could cut all power to the circuit. Instead, a data transmission setting-switch could appear off, but the computer could still be logging or transmitting data. Because operating systems control these switches, privacy law must be strictest at this level or the stored data could be later exploited.

Data Anonymization not a Viable Solution

Do not be misled by claims of anonymization. Many operating system developers claim to anonymize data. It has been shown that, “data can either be useful or perfectly anonymous but never both.” This is because powerful correlation analyses can stitch together seemingly innocuous data sets and tie them to an individual user. For example, using a zip code, date of birth and gender an individual can be uniquely identified with 87% accuracy. The utility of demographic data would be hampered if any of those three factors were removed (for example, the ability to determine gender within a zip code, or date of birth of different genders). In sum, if data were truly anonymous, it would not be useful.

4 Comments

Down the Rabbit Hole