TF-IDF code and synthetic data for the paper "Who do you think you are? An experimental comparison of online and offline identities using NLP techniques"

Researchers have been debating whether our online identities are different from our offline identities since technology enabled us to operate in the digital space. This research advances the online-offline debate by utilizing Natural Language Processing (NLP) techniques to compare identity-related statements across offline and online contexts. Participants (N = 120) completed twenty “I am…” statements in two conditions: one reflecting offline identity and one reflecting online identity. Distinct linguistic patterns in textual data were identified using TF-IDF, odd ratios, and Fisher test. The results revealed that online identity statements emphasized themes of (anonymous) connectivity, whereas offline identity statements focused on physical and emotional states and personal relationships. These findings suggest that, despite blurring boundaries between online and offline worlds, individuals continue to present their identities in contextually distinct ways. This research provides insights into the conceptualization of identity and highlights the need to design digital spaces that better balance connectivity with opportunities for deeper emotional and relational engagement.

Keywords:
TF-IDF

Cite this dataset as:
Johansen, J., Piwek, L., 2026. TF-IDF code and synthetic data for the paper "Who do you think you are? An experimental comparison of online and offline identities using NLP techniques". Bath: University of Bath Research Data Archive. Available from: https://doi.org/10.15125/BATH-01505.

Export

Data

Synthetic-Data.xlsx
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet (14kB)
Creative Commons: Attribution 4.0

The attached spreadsheet contains synthetic data, to aid understanding and reproducibility. The column names represent that of the clean data set, but all of their contents are examples created by the researchers based on the raw data.

Survey.pdf
application/pdf (199kB)
Creative Commons: Attribution 4.0

The attached PDF is the full survey provided to participants, including an introduction, briefing information, consent form, demographic questions, TST prompts, and debriefing information.

Code

Comparing-Offline … R-Script.zip
application/zip (2kB)
Creative Commons: Attribution 4.0

The attached R code facilitates TF-IDF on participants' identity-related statements.

liwc_analysis.R
text/plain (6kB)
Creative Commons: Attribution 4.0

The attached R code facilitates LIWC analysis on participants' identity-related statements. Note, purchase of the LIWC 2015 source dictionary is required.

Creators

Lukasz Piwek
University of Bath

Contributors

Adam Joinson
Supervisor
University of Bath

Catherine Hamilton-Giachritsis
Supervisor
University of Bath

University of Bath
Rights Holder

Documentation

Data collection method:

The data that has been used is confidential, and therefore not provided. Instead, synthetic data is provided. Code was written by authors.

Data processing and preparation activities:

Data processing steps are included in the code. The survey provided to participants has been attached so that researchers can collect their own versions of the data. It uses the same set of twenty "I am ..." prompts used in the well-established Twenty Statements Test.

Technical details and requirements:

R Studio.

Additional information:

Kuhn, M. H., & McPartland, T. S. (1954). Twenty statements test. American Sociological Review.

Documentation Files

Survey_Comparing … Identity.docx
application/vnd.openxmlformats-officedocument.wordprocessingml.document (15kB)
Creative Commons: Attribution 4.0

Funders

Engineering and Physical Sciences Research Council
https://doi.org/10.13039/501100000266

PhD Studentship
EP/S022465/1

Publication details

Publication date: 18 June 2026
by: University of Bath

Version: 1

DOI: https://doi.org/10.15125/BATH-01505

URL for this record: https://researchdata.bath.ac.uk/1505

Related papers and books

Johansen, J., Hamilton-Giachritsis, C., Piwek, L., and Joinson, A., 2026. Who do you think you are? An experimental comparison of online and offline identities using NLP techniques. International Journal of Human-Computer Studies, 215, 103874. Available from: https://doi.org/10.1016/j.ijhcs.2026.103874.

Contact information

Please contact the Research Data Service in the first instance for all matters concerning this item.

Contact person: Jessica Johansen

Departments:

Faculties and Schools
School of Management

Research Centres & Institutes
EPSRC Centre for Doctoral Training in Cyber Security