TF-IDF code and synthetic data for the paper "Who do you think you are? An experimental comparison of online and offline identities using NLP techniques"
Researchers have been debating whether our online identities are different from our offline identities since technology enabled us to operate in the digital space. This research advances the online-offline debate by utilizing Natural Language Processing (NLP) techniques to compare identity-related statements across offline and online contexts. Participants (N = 120) completed twenty “I am…” statements in two conditions: one reflecting offline identity and one reflecting online identity. Distinct linguistic patterns in textual data were identified using TF-IDF, odd ratios, and Fisher test. The results revealed that online identity statements emphasized themes of (anonymous) connectivity, whereas offline identity statements focused on physical and emotional states and personal relationships. These findings suggest that, despite blurring boundaries between online and offline worlds, individuals continue to present their identities in contextually distinct ways. This research provides insights into the conceptualization of identity and highlights the need to design digital spaces that better balance connectivity with opportunities for deeper emotional and relational engagement.
Cite this dataset as:
Johansen, J.,
Piwek, L.,
2026.
TF-IDF code and synthetic data for the paper "Who do you think you are? An experimental comparison of online and offline identities using NLP techniques".
Bath: University of Bath Research Data Archive.
Available from: https://doi.org/10.15125/BATH-01505.
Export
Data
Synthetic-Data.xlsx
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet (14kB)
Creative Commons: Attribution 4.0
The attached spreadsheet contains synthetic data, to aid understanding and reproducibility. The column names represent that of the clean data set, but all of their contents are examples created by the researchers based on the raw data.
Survey.pdf
application/pdf (199kB)
Creative Commons: Attribution 4.0
The attached PDF is the full survey provided to participants, including an introduction, briefing information, consent form, demographic questions, TST prompts, and debriefing information.
Code
Comparing-Offline … R-Script.zip
application/zip (2kB)
Creative Commons: Attribution 4.0
The attached R code facilitates TF-IDF on participants' identity-related statements.
liwc_analysis.R
text/plain (6kB)
Creative Commons: Attribution 4.0
The attached R code facilitates LIWC analysis on participants' identity-related statements. Note, purchase of the LIWC 2015 source dictionary is required.
Contributors
Adam Joinson
Supervisor
University of Bath
Catherine Hamilton-Giachritsis
Supervisor
University of Bath
University of Bath
Rights Holder
Documentation
Data collection method:
The data that has been used is confidential, and therefore not provided. Instead, synthetic data is provided. Code was written by authors.
Data processing and preparation activities:
Data processing steps are included in the code. The survey provided to participants has been attached so that researchers can collect their own versions of the data. It uses the same set of twenty "I am ..." prompts used in the well-established Twenty Statements Test.
Technical details and requirements:
R Studio.
Additional information:
Kuhn, M. H., & McPartland, T. S. (1954). Twenty statements test. American Sociological Review.
Documentation Files
Survey_Comparing … Identity.docx
application/vnd.openxmlformats-officedocument.wordprocessingml.document (15kB)
Creative Commons: Attribution 4.0
Funders
Engineering and Physical Sciences Research Council
https://doi.org/10.13039/501100000266
PhD Studentship
EP/S022465/1
Publication details
Publication date: 18 June 2026
by: University of Bath
Version: 1
DOI: https://doi.org/10.15125/BATH-01505
URL for this record: https://researchdata.bath.ac.uk/1505
Related papers and books
Johansen, J., Hamilton-Giachritsis, C., Piwek, L., and Joinson, A., 2026. Who do you think you are? An experimental comparison of online and offline identities using NLP techniques. International Journal of Human-Computer Studies, 215, 103874. Available from: https://doi.org/10.1016/j.ijhcs.2026.103874.
Contact information
Please contact the Research Data Service in the first instance for all matters concerning this item.
Contact person: Jessica Johansen
Faculties and Schools
School of Management
Research Centres & Institutes
EPSRC Centre for Doctoral Training in Cyber Security