THIS README IS FOR THE DATA/RAW_DATA/COVARIATES DIRECTORY.

THIS DIRECTORY CONTAINS THE RAW SOCIO-ECONOMIC DATA FOR EACH CITY, WHICH ARE AVAILABLE FROM THE US CENSUS BUREAU THROUGH THE AMERICAN COMMUNITY SURVEY (ACS). THESE ARE AVAILABLE AT THE CENSUS TRACT LEVEL FOR COUNTIES WITHIN STATES AND SO WE NEED TO USE THE PROCESSED SHAPEFILE DATA TO EXTRACT THE RELEVANT CENSUS TRACTS FOR THE CITIES. THE OUTPUT COVARIATES ARE SAVED IN THE SUB-DIRECTORIES FOR EACH CITY INDIVIDUALLY. WE HAVE TWO R SCRIPTS FOR THE SOCIO-ECONOMIC DATA, THE FIRST, MAIN SCRIPT CREATES THE SEPERATE SOCIO-ECONOMIC DATA HOWEVER, IT TREATS ALL THE MISSING AVERAGE INCOME DATA THE SAME WHILE THE SECOND SCRIPT INSTEAD CONSIDERS WHICH OF THESE MAY BE DUE TO ZERO TOTAL ESTIMATE HOUSEHOLDS IN A CENSUS TRACT AND THESE ARE ASSIGNED ZERO. THERE ARE SUB-DIRECTORIES FOR EACH CITY, WHICH CONTAIN THE RAW SOCIO-ECONOMIC DATA AND THE PROCESSED SOCIO-ECONOMIC DATA. MORE DETAIL ABOUT THESE DIFFERENT METHODS CAN BE FOUND DISCUSSED IN CHAPTER 4 OF MY THESIS AS WELL AS THE NECESSARY README FILES FOR THE SUB-DIRECTORIES.

THE RAW AND MANIPULATED DATA FILES ARE NOT CONTAINED WITHIN THIS ARCHIVED FOLDER BUT THE RAW DATA CAN BE ACCESSED THROUGH THE US CENSUS BUREAU ACS DATA AS DISCUSSED IN DataAccessInformation.pdf AS WELL AS DESCRIBED IN APPENDIX F OF MY THESIS. ALSO THE MANIPULATED DATA CAN BE CREATED THROUGH THE R FILES IN THIS DIRECTORY. WHILE WE DO NOT HAVE THE FILES ARCHIVED, WE  DISCUSS THE NAMING CONVENTIONS BELOW AS THEY ARE USED WITHIN THE R SCRIPTS.

NOTE: THE DATA FROM THE ACS MAY BE UPDATED, ESPECIALLY WITH RESPECT TO THE LOS ANGELES 2015 DATA, WHICH WAS ORIGINALLY ACCESSED THROUGH THE AMERICAN FACTFINDER (WHICH WAS LATER RETIRED), AND SO NEWLY ACCESSED DATA MAY HAVE DIFFERENT COLUMN SET-UPS/NAMES, WITH SIMILARITIES TO BE FOUND IN THE MORE RECENTLY ACCESSED NEW YORK AND PORTLAND DATA.

- CovDataGen_final.R: this takes the raw socio-economic data files and extracts the necessary census tract estimates of the variablend the margin of erro as well as the geolocation in terms of the label for the census tract and the coordinates with respect to the projected census tract. In this R script considers the average income and treats all of the missing data the same, interpolating the values in the census tracts as the average of the neighbouring census tracts, where we include the census tracts in the county and not necessarily in the city for the interpolation.

- CovDataGen_Inc_final.R: this is similar to the above, however we only consider the average income in this R script and, in particular, we set the missing average income estimates to 0 if the estimate of the total household estimate for that census tract is also 0.

- Raw Data in city sub-directories: 
For Los Angeles the 2015 data was accessed earlier using the American Fact Finder through the US Census Bureau, however this was retired in March 2020 and so the 2015 data for New York and Portland as well as the 2014 Los Angeles data were accessed through the new US Census Bureau API and so the naming convention is slightly different. In order to get the relevant census tract data for a particular city we have to select the census tracts in the counties that contain the city of interest.
	-- Los Angeles 2015 in folders:
		--- ACS_15_5YR_B01003: the total population over census tracts in 2015
		--- ACS_15_5YR_S1902: the average income over the census tracts in 2015

	-- Los Angeles 2014 in folders:
		--- ACSDT5Y2014.B01003_*: the total population over census tracts in 2014
		--- ACSST5Y2014.S1902_*: the average income over the census tracts in 2014

	-- New York 2015 in folders:
		--- ACSDT5Y2015.B01003_*: the total population over census tracts in 2015
		--- ACSST5Y2015.S1902_*: the average income over census tracts in 2015
	-- Portland 2015 in folders:
		--- ACSDT5Y2015.B01003_*: the total population over census tracts in 2015
		--- ACSST5Y2015.S1902_*: the average income over census tracts in 2015

(For each city we also had data with codes S0101 for age and sex, S2201 for food stamps/SNAP, B25003 for tenure of properties (owned/rented), and so, while the code contains the ability to include these, I will not concentrate on these. These codes can make finding the relevant tables on the US Census Bureau API much easier, if they are required. Additionally, the code for the extraction of these variables, while still in the R scripts, have been commented out.)


- Outputs:
	-- CovDataGen_final.R:
		--- *_CTPop_15_proj.rds: total population for census tracts in city *.
		--- *_CTInc_15_imp_proj.rds: imputed average income where all missing data is treated the same and interpolated using the average of neighbouring census tracts.
	-- CovDataGen_Inc_final.R:
		--- *_CTInc_15_imp0_proj.rds: imputed missing data, where some of the missing data linked to census tracts with an estimated total household of 0. 

These outputs are also copied over from each city sub-directory into the DATA/PROCESSED_DATA/COVARIATES directory, without separation into individual city-based sub-directories. These are then used in the creation of the count data (census tract and gridded) within the DATA/PROCESSED_DATA/CRIME directory and its sub-directories as well as the DATA/MODELS/GLMS directory for the Ripley's K estimation.

Note: for the Minimum Contrast we also have 2014 Los Angeles data to use instead of the 2015 data, however the covariate data is not created in this sub-directory, instead in the DATA/MODELS directory where the 2014 data is required, along with the census tract count and grid count data.
