The Internet is the world’s largest repository of user-generated data. Webpages, social media, forums, reviews, blog posts, and search data, when analyzed at scale, can reveal profound insights into consumer preferences and behavior. We at Quilt.AI specialize in interpreting the Internet to lead organizations toward better business decisions.
How do we make sure that the information we gather is anonymized and does not infringe on the privacy of the individual? Although all of the information we collect is publicly available, the open Web holds a large quantity of personally-identifiable information (PII) including names, phone numbers, and email addresses.
We are extra-careful about personal privacy for both ethical and compliance reasons. Of course, we do not want nor need PII in order to extract insights for a given demographic — we only need aggregate data. We are not in the business of the individual; we are in the business of the cohort. So, how do we discard PII seamlessly while retaining important information?
To address the issue of data anonymization, we needed to build a PII filter. When an item of content is pulled from the Internet, the PII filter should discard any sensitive data that might exist in that content item.
An engineering team may choose to manually process each content item to perform
This solution is obviously not very robust — a previously-unseen name would not be recognized by such a filter. Furthermore, this solution is not context-aware — a sentence like “My name is Apple” should indicate that ‘Apple’ is a PII item and, therefore to be discarded, but a simple lookup would not achieve this.
With recent advances in machine learning (ML) and specifically in natural language processing, it is possible to filter PII in a more contextual way. For our use case, we found the right tool in Presidio, an open-source Python library from Microsoft that offers pre-trained models for identifying and removing PII from the text.
In order to use Presidio, we need two packages: presidio-analyzer and presidio-anonymizer. The former is responsible for the heavy processing and outputs a format that is used by the latter to anonymize and replace information within a sentence with the appropriate tag. Both packages can be installed with the usual pip commands:
After both packages are installed, we need to download a model. Presidio can use either Spacy (default) or Stanza. When looking into the different models available from these repositories, we decided to stay with the default Spacy, and to use the default English model, which can be downloaded with:
After the model is downloaded, we need to run it and specify the entities we want to detect as well as the language our input text is in. All entities and languages supported by each model can be checked on their respective repo websites. For our test case, we’ll be using PERSON and EMAIL_ADDRESS entities and the English language.
Let’s instantiate the model, then pass a sentence to it, and see the results:
As we can see, the analyzer outputs a list containing all identified PII entities, including their location within the sentence.
After this, we instantiate our Presidio anonymizer and feed the results of the analyzer to it:
The final result is the text with masked PII entities.
Once we have our anonymized text, we can proceed with our analytics (sentiment, semiotics etc.) with no risk of infringing on individual privacy.
An open question remains around the “lossiness” of the PII filter. Since sentences such as “I love Luke but hate Anakin” would be transformed to “I love <PERSON> but hate <PERSON>”, do we actually dilute our insights when using the PII filter? While the intuitive answer would be in the affirmative, it is interesting to note that for large real-world Internet datasets we did not find a large qualitative difference in the quality of insights obtained. This is likely attributable to the nature of our datasets — we choose data about brands, places, and experiences and not data about people. Nevertheless, an intelligent masking system that distinguishes between PERSON1 and PERSON2 might be useful to explore.
At Quilt.AI, we use machine learning to extract cultural meaning from publicly-available, anonymized Internet data. Reach out to us at [email protected] for more information!
synthesizing vast data into actionable insights that reflect each market's unique cultural and economic backdrop
grasping the distinct consumer perspectives that these diverse regions offer
Curated digital profiles:
-Instagram, Twitter, and TikTok (US)
-Weibo and Douyin (China)
Pulled 400 million unique searches to estimate the growth of each segment
Used Quilt.AI’s Sphere language and image capabilities to categorise lifestyle areas into specific segments
These consumers are confident, bold, and comfortable with modern masculinity. They also often turn to social media to express their personal style and interests.
Actionable Insight: Collaborate with high-profile fashion influencers to create vibrant, trend-setting campaigns that resonate with this segment's desire for attention and admiration.
Highly image-driven, these individuals often seek validation through their appearance and are likely to engage heavily with both grooming and fashion products.
Actionable Insight:Leverage digital marketing strategies that feature before-and-after visuals and testimonials that showcase the transformative power of the products
These men aim to be recognized as modern, open-minded, and sensitive – embodying the image of "the woke good guy" in today's society by actively participating in movements related to activism and gender equality.
Actionable Insight:Design marketing campaigns that highlight their participation in these movements, showcasing products that enable them to express and amplify their desired social identities.
They value beauty while still maintaining traditional masculine ideals of what it means to be good-looking. These men also tend to seek out methods of maintaining their youthful appearances.
Actionable Insight:Market products that boost physical appeal and suit active lifestyles, and focus on dynamic marketing that highlights masculine elegance.
Despite seeing gender in traditionally binary terms, these men aren’t afraid of behaving in more feminine manners. They own their uniqueness and tend to be deeply loyal to brands that affirm their identity.
Actionable Insight:Focusing on brand narratives that celebrate individuality and personal expression will better engage this segment. Brands can also offer personalized services to maintain their commitment.
Despite seeing gender in traditionally binary terms, these men aren’t afraid of behaving in more feminine manners. They own their uniqueness and tend to be deeply loyal to brands that affirm their identity.
Actionable Insight:Focusing on brand narratives that celebrate individuality and personal expression will better engage this segment. Brands can also offer personalized services to maintain their commitment.