## Prepare Field Mapping

# Stanford Locator Geocoding Notebook

This Jupyter Notebook provides a workflow for batch geocoding addresses using Stanford's ArcGIS geocoding service, available at [locator.stanford.edu](https://locator.stanford.edu/). The service allows users to submit large numbers of addresses and receive geographic coordinates and related location information in return.

## What This Notebook Does

The notebook automates the process of submitting address data to the ArcGIS GeocodeAddresses REST API. It reads your input CSV file, processes the addresses in manageable batches, and writes the geocoded results to a new CSV file. The workflow is designed for efficiency and reliability, supporting large datasets and providing progress updates throughout the geocoding job.

## How to Use This Notebook

1. **Set Input Parameters:** Update the input parameters such as the path to your CSV file, output file location, and batch size in the designated cell.
2. **Run the Notebook:** Execute the cells in order. The notebook will read your address data, submit it to the geocoding service, and save the results.
3. **Monitor Progress:** The notebook prints progress updates and final statistics, so you can track the status of your geocoding job.

## Preparing Your Address Table

To ensure successful geocoding, your input CSV file must follow the required schema. Each column should match the field names expected by the ArcGIS geocoding service. Refer to the schema in the code block below for the correct column headers and structure.


### Dictionary template 
Your input CSV column headers should conform to the following:
```json
    arcgis_address_format = {
        "Address": "",
        "Neighborhood": "",
        "City": "",
        "Subregion": "",  # Typically county or equivalent
        "Region": "",  # Typically state or equivalent
        "Postal": "",
        "CountryCode": ""
    }
````

# Imports

### What does this do?

The **imports section** brings in external Python modules (libraries) that provide extra features for your code. Here’s what each one does:

- **csv**  
  [csv documentation](https://docs.python.org/3/library/csv.html)  
  This is a built-in Python module for reading from and writing to CSV (Comma-Separated Values) files. It helps you handle spreadsheet-like data.

- **requests**  
  [requests GitHub repo](https://github.com/psf/requests) | [requests documentation](https://requests.readthedocs.io/)  
  This is a popular third-party library for making HTTP requests (like GET and POST) to web servers and APIs. In this notebook, it’s used to send address data to the ArcGIS geocoding service and get results back.

- **json**  
  [json documentation](https://docs.python.org/3/library/json.html)  
  This is a built-in Python module for working with JSON (JavaScript Object Notation) data. JSON is a common format for sending data between computers, especially over the web.

- **time**  
  [time documentation](https://docs.python.org/3/library/time.html)  
  This is a built-in Python module for working with time and measuring how long things take. Here, it’s used to track how long the geocoding process takes and estimate how much time is left.

---

**Tip:**  
- Built-in modules like `csv`, `json`, and `time` come with Python, so you don’t need to install anything extra to use them.
- Third-party modules like `requests` need to be installed first (usually with `pip install requests`).

In [None]:
!pip install requests # If necessary, install requests library

In [None]:
import csv      # For reading and writing CSV files (input addresses and output results)
import requests # For making HTTP requests to the ArcGIS geocoding service API
import json     # For encoding/decoding data to/from JSON format (used in API requests/responses)
import time     # For tracking and reporting elapsed/estimated time during batch geocoding

## Input Parameters

  
Below are the input parameters you can adjust to control how the geocoding process works. Each parameter is explained with its purpose, default value, and possible options:

- **csv_file_path**  
  *Type:* `str`  
  *Default:* `'../Data/199Addresses.csv'`  
  *Description:* The relative path to your input CSV file containing the addresses you want to geocode. Make sure this file exists and follows the required schema. The path is relative to the Code folder where this notebook is located.

- **output_csv_path**  
  *Type:* `str`  
  *Default:* `'../Data/geocoded_199Addresses.csv'`  
  *Description:* The relative path (including filename) where the geocoded results will be saved. If the file already exists, new results will be appended. Results will be saved in the Data folder.

- **arcgis_service_url**  
    *Type:* `str`  
    *Default:* `'https://locator.stanford.edu/arcgis/rest/services/geocode/USA/GeocodeServer/geocodeAddresses'`  
    *Description:* The URL for the ArcGIS geocoding service. You usually do not need to change this unless you are using a different geocoding server.

    **Available locator services on [locator.stanford.edu](https://locator.stanford.edu):**
    - **Asia Pacific:**  
        `https://locator.stanford.edu/arcgis/rest/services/geocode/AsiaPacific/GeocodeServer/geocodeAddresses`
    - **Europe:**  
        `https://locator.stanford.edu/arcgis/rest/services/geocode/Europe/GeocodeServer/geocodeAddresses`
    - **Latin America:**  
        `https://locator.stanford.edu/arcgis/rest/services/geocode/LatinAmerica/GeocodeServer/geocodeAddresses`
    - **Middle East & Africa:**  
        `https://locator.stanford.edu/arcgis/rest/services/geocode/MiddleEastAfrica/GeocodeServer/geocodeAddresses`
    - **North America:**  
        `https://locator.stanford.edu/arcgis/rest/services/geocode/NorthAmerica/GeocodeServer/geocodeAddresses`
    - **USA:**  
        `https://locator.stanford.edu/arcgis/rest/services/geocode/USA/GeocodeServer/geocodeAddresses`

    Choose the service that best matches your address data region.

- **jobSize**  
  *Type:* `int` or `'all'`  
  *Default:* `'all'`  
  *Description:* The total number of address records to process.  
    - Set to an integer (e.g., `100`) to process only the first N records (useful for testing).  
    - Set to `'all'` to process every record in your input file.

- **chunkSize**  
  *Type:* `int`  
  *Default:* `100`  
  *Description:* The number of address records sent to the geocoding service in each batch.  
    - The recommended value is between 20 and 1000.  
    - Too high a value may cause errors; too low may slow down processing.
    - For the sample 199Addresses.csv file, 100 is a good chunk size.

- **outFields**  
  *Type:* `str`  
  *Default:* `'*'`  
  *Description:* Controls which fields are included in the output.  
    - Use `'*'` to include all available output fields.  
    - Use `'none'` for minimal output (just latitude and longitude).

- **printJob**  
  *Type:* `str`  
  *Default:* `'yes'`  
  *Description:* Whether to print each API request to the console for debugging.  
    - Set to `'yes'` to print requests.  
    - Set to `'no'` to suppress this output.

**Tip:**  
If you are new to Python, you can change these values directly in the code cell where they are defined. Make sure to keep the correct data type (e.g., use quotes for text, no quotes for numbers).

**Available Sample Datasets in the Data Folder:**
- `199Addresses.csv` - Small test dataset (199 addresses)
- `clowns.csv` - Clown business locations
- `SantaClara_TattooParlors.csv` - Tattoo parlor locations in Santa Clara County
- `oneMillionAddresses.csv` - Large dataset for testing (1 million addresses)

Choose the dataset that best fits your needs or use your own CSV file following the required schema.

In [None]:
# Input and output file paths (relative to the Code folder)
csv_file_path = '../Data/199Addresses.csv'
output_csv_path = '../Data/geocoded_199Addresses.csv'

# ArcGIS geocoding service URL
arcgis_service_url = 'https://locator.stanford.edu/arcgis/rest/services/geocode/USA/GeocodeServer/geocodeAddresses'

# Processing parameters
jobSize = 'all'  # Process all records, or set to a number like 100 for testing
chunkSize = 100  # Number of records per batch
outFields = '*'  # Include all output fields
printJob = 'yes'  # Print progress updates

## Prepare and submit GET Requests from CSV

The `geocode_addresses_batch_rest` function reads your address CSV file, sends the addresses in batches to the ArcGIS geocoding service, and saves the results to a new CSV file. 

**How it works:**
- Reads your input CSV file and splits the addresses into batches (size set by `chunkSize`).
- For each batch, sends the addresses to the ArcGIS geocoding service and gets back location results.
- Writes the geocoded results to your output CSV file, adding new results as they come in.
- Shows progress updates, including how many records are done and how much time is left.

In [None]:
def geocode_addresses_batch_rest(
    csv_file_path,
    arcgis_service_url,
    output_csv_path,
    jobSize,
    chunkSize,
    outFields,
    printJob
):
    """
    Processes a CSV file to geocode addresses using the ArcGIS Server GeocodeAddresses REST batch endpoint.
    Submits up to 1000 records per request.
    Appends each chunk to the output CSV so the process can be interrupted and resumed.
    Reports final stats at the end.
    """

    # Open the input CSV file for reading. The file contains address records to be geocoded.
    with open(csv_file_path, mode='r', newline='', encoding='utf-8') as file:
        reader = csv.DictReader(file)  # Reads the CSV into a list of dictionaries (one per row)
        addresses = list(reader)
        # If jobSize is not 'all', only process up to jobSize records
        if jobSize != 'all':
            addresses = addresses[:int(jobSize)]

    total_records = len(addresses)  # Total number of records to process
    start_time = time.time()  # Record the start time for progress reporting

    # ArcGIS REST API allows up to 1000 records per batch request.
    # chunkSize is set to the minimum of user-specified chunkSize and 1000.
    chunkSize = min(int(chunkSize), 1000)
    # Split the addresses into batches of chunkSize
    batches = [addresses[i:i + chunkSize] for i in range(0, total_records, chunkSize)]

    csv_exists = False  # Tracks if the output CSV already exists (to write header only once)
    fieldnames = set()  # Collects all field names encountered in the geocoded results
    total_processed = 0  # Counter for total processed records

    # Iterate over each batch of addresses
    for batch_index, batch in enumerate(batches):
        # Prepare the records in the format required by the ArcGIS REST API
        records = {
            "records": [
                {
                    "attributes": {
                        "OBJECTID": idx + batch_index * chunkSize,  # Unique ID for each record
                        **{key: record.get(key, "") for key in record}  # Include all fields from the input
                    }
                } for idx, record in enumerate(batch)
            ]
        }
        # Set up the parameters for the POST request to the geocoding service
        params = {
            'f': 'json',  # Response format
            'outFields': outFields,  # Fields to return in the response
            'addresses': json.dumps(records)  # The batch of addresses as a JSON string
        }

        # Optionally print progress for each batch
        if printJob.lower() == 'yes':
            print(f"Submitting batch {batch_index+1}/{len(batches)} with {len(batch)} records...")

        try:
            # Send the POST request to the ArcGIS geocoding service
            response = requests.post(arcgis_service_url, data=params)
            response.raise_for_status()  # Raise an error if the request failed
            resp_json = response.json()  # Parse the response as JSON
        except Exception as e:
            # Print error message and skip this batch if the request fails
            print(f"Request failed: {e}")
            print(f"Response content: {getattr(response, 'text', '')}")
            continue

        # Process each geocoded location in the response
        batch_records = []
        for location in resp_json.get('locations', []):
            # Each location should have an 'attributes' dictionary with geocoded data
            if isinstance(location.get('attributes'), dict):
                batch_records.append(location['attributes'])
            else:
                print(f"Warning: Unexpected data format in response: {location}")

        # Update the set of fieldnames with any new fields from this batch
        for record in batch_records:
            fieldnames.update(record.keys())

        # Write the geocoded batch results to the output CSV file
        if batch_records:
            write_header = not csv_exists  # Write header only for the first batch
            with open(output_csv_path, mode='a', newline='', encoding='utf-8') as file:
                writer = csv.DictWriter(file, fieldnames=sorted(fieldnames))
                if write_header:
                    writer.writeheader()
                for record in batch_records:
                    writer.writerow(record)
            csv_exists = True  # Mark that the CSV now exists

        # Progress reporting for the user
        processed = (batch_index + 1) * chunkSize  # Number of records processed so far
        if processed > total_records:
            processed = total_records
        total_processed += len(batch_records)  # Update total processed count
        remaining = total_records - processed  # How many records are left
        elapsed_time = time.time() - start_time  # Time elapsed so far
        if processed > 0:
            # Estimate total and remaining time based on progress so far
            estimated_total_time = elapsed_time / processed * total_records
            estimated_remaining_time = estimated_total_time - elapsed_time
            print(f"Processed {processed}/{total_records} records. Remaining: {remaining}. Estimated time to finish: {time.strftime('%H:%M:%S', time.gmtime(estimated_remaining_time))}")

    # After all batches are processed, print final statistics
    total_time = time.time() - start_time
    records_per_hour = (total_processed / total_time) * 3600 if total_time > 0 else 0
    print(f"Geocoded data appended to {output_csv_path}")
    print(f"Final stats: {total_processed} records processed in {total_time:.2f} seconds ({records_per_hour:.2f} records/hour)")

## Running the Geocoding Function

The next code block actually starts the geocoding process by calling the `geocode_addresses_batch_rest` function.  
It uses the parameter values you set earlier in the notebook (like `csv_file_path`, `output_csv_path`, etc.).

**How it works:**
- When you run the code block, it will read your input CSV file, send the addresses to the ArcGIS geocoding service in batches, and write the results to your output CSV file.
- The function uses the current values of the parameter variables. If you change any of these variables in this cell (for example, set a different `chunkSize`), those new values will be used instead of the earlier ones.

**Tip:**  
You can edit the parameter values directly in this code block to override the settings from the parameter section above. This is useful if you want to quickly test different options without changing the main parameter cell.

### Feedback
When you run the geocoding function, you’ll see feedback printed directly in the notebook. This feedback includes:

- **Batch Submission Updates:**  
    For each batch of addresses sent to the geocoding service, the notebook prints which batch is being submitted and how many records it contains.

- **Progress Reports:**  
    After each batch, you’ll see how many records have been processed, how many remain, and an estimated time to finish based on the current speed.

- **Error Messages:**  
    If a batch fails to process, an error message will be printed with details to help you troubleshoot.

- **Final Statistics:**  
    When all batches are complete, the notebook prints a summary showing the total number of records processed, the total time taken, and the average processing speed (records per hour).

These messages help you monitor the geocoding job and estimate how long it will take to finish. If you want less feedback, you can set the `printJob` parameter to `'no'`.

In [None]:
# Example usage:
geocode_addresses_batch_rest(
    csv_file_path=csv_file_path,
    arcgis_service_url=arcgis_service_url,
    output_csv_path=output_csv_path,
    jobSize=jobSize,
    chunkSize=chunkSize, 
    outFields=outFields,
    printJob=printJob
)