How to Load SBA all thePaycheck Protection Program Loan Data into MySql or any RDBMS.

 

Loading any new Text file into any RDBMS often follows these steps:

  1. Download the file, the process of downloading the file should be automated if your needing to download on a regular basis.
  2. Review the file data, pay close attention to a file header if one exists.
  3. Create the database tables to store the data, I prefer to not do much altering of the source files while loading the data so that you can audit the data if needed at future time.  I also think it is wise to store the file name that was loaded and the date/time the file was loaded.  Think of this initial table as a staging table, that’s sole purpose is to store the data in the RDBMS, as close to the source file as reasonable.
  4. Load the data, if you hit an error because a column is not large enough to store the source data, adjust the table data structures, and empty the table and start again.
  5. Validate the data: Make sure to double check the data correctly, by doing a record count to make sure all records were loaded, and make sure no data types got all messed up.

 

Download the PPP Loan file that contains the data for all States and territories:

Get the file here:https://home.treasury.gov/policy-issues/cares-act/assistance-for-small-businesses/sba-paycheck-protection-program-loan-level-data

The data comes as one zipped file with a file structure like this:

 

We only care about finding the .csv files 

So we're going to have to make a decision, do we write some code to crawl the directory structure of the zipped file, to extract the file name and path to the loading process, or we could also extrac those file names and push the file names to the loader.  Lets see an example of how to get all the file names out of the zipped file.  **** this assumes a linux or unix environment  **** for a windows solution look at the code example that crawls the directory looking for the .csv files below

Linux soltuion

Unzip the file into it's own directory, navigate to the directory and issue the following command:

find -name *.csv  | grep -v "__MACOSX*"    

You should get a list like this:

./Maryland/foia_up_to_150k_MD.csv
./New York/foia_up_to_150k_NY.csv
./South Dakota/foia_up_to_150k_SD.csv
./Minnesota/foia_up_to_150k_MN.csv
./North Carolina/foia_up_to_150k_NC.csv

The above will give you a list of all the .csv files, but becareful you have one issue here is the file "/150k plus/foia_150k_plus.csv"  this file isn't the same layout as the State and territory, see the ERD below for the differences:

Here is a Windows PHP example of crawling through the directories looking for any .csv file and passing that file to the loader:


function getListofDirectories($dir){
    $result = array();
    $cdir = scandir($dir);
    foreach ($cdir as $key => $value)
    {
        if (!in_array($value,array(".","..",'__MACOSX')))  // this just excludes directories that are not needed could also add more items if needed!
        {
            if (is_dir($dir . DIRECTORY_SEPARATOR . $value))
            {
                $result[] =$dir .DIRECTORY_SEPARATOR.$value;
            }
        }
    }
    return($result);
}


The above function will return an array of the full path to the file, which is passed to the loader that kicks off loading file by file and dies if it hit's a catostrphic error.

 

The Meta Data for the SBA Paycheck Protection Program Loan Data

LoanAmount,City,State,Zip,NAICSCode,BusinessType,RaceEthnicity,Gender,Veteran,NonProfit,JobsRetained,DateApproved,Lender,CD

When loading external data you should track the source file and date/time loaded too!  

 

Create the table structure for the PPP data ( DDL ):

CREATE TABLE `pppdatalargeloans` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`loanrange` varchar(40) DEFAULT NULL,
`BusinessName` varchar(200) DEFAULT NULL,
`Address` varchar(200) DEFAULT NULL,
`city` varchar(200) DEFAULT NULL,
`state` varchar(20) DEFAULT NULL,
`zipcode` varchar(20) DEFAULT NULL,
`NAICSCode` varchar(30) DEFAULT NULL,
`businesstype` varchar(200) DEFAULT NULL,
`raceethnicity` varchar(200) DEFAULT NULL,
`gender` varchar(20) DEFAULT NULL,
`veteran` varchar(20) DEFAULT NULL,
`nonprofit` varchar(50) DEFAULT NULL,
`jobsretained` varchar(30) DEFAULT NULL,
`dateapproved` date DEFAULT NULL,
`lender` varchar(200) DEFAULT NULL,
`cd` varchar(200) DEFAULT NULL,
`dateloaded` datetime DEFAULT NULL,
`filename` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`));
CREATE TABLE `pppdata`
( `id` int(11) NOT NULL AUTO_INCREMENT,
`loanamount` decimal(15,2) DEFAULT NULL,
`city` varchar(200) DEFAULT NULL, `state` varchar(20) DEFAULT NULL,
`zipcode` varchar(20) DEFAULT NULL,
`aicscode` varchar(30) DEFAULT NULL,
`businesstype` varchar(200) DEFAULT NULL,
`raceethnicity` varchar(200) DEFAULT NULL,
`gender` varchar(20) DEFAULT NULL,
`vendor` varchar(50) DEFAULT NULL,
`nonprofit` varchar(50) DEFAULT NULL,
`jobsretained` varchar(50) DEFAULT NULL,
`dateapproved` date DEFAULT NULL,
`lender` varchar(200) DEFAULT NULL,
`cd` varchar(200) DEFAULT NULL,
`dateloaded` datetime DEFAULT NULL,
`filename` varchar(100) DEFAULT NULL, PRIMARY KEY (`id`));

 

Loading the PPP Loan Data:

 

Loading data in an ETL tool like Talend, is great for situations like this, except getting talend to traverse all the directories and locate the right files, might be possible but once I wrote the code to locate the files I just wrote the loader in PHP, as I have a debugger that makes development allot easier than the old days!

function load_ppfile_data($data,$conn,$filename){
    $st =strripos($filename,"\\");
    $newfilename = substr($filename,$st+1,strlen($filename)-$st);
    $sql = 'INSERT INTO pppdata
                VALUES (null,'
.$data[0].',"'.$data[1].'","'.$data[2].'","'.$data[3].'","'.$data[4].'","'.$data[5].'","'.$data[6].'","'.$data[7].'","'.$data[8].'","'.$data[9].'","'.$data[10].
        '",STR_TO_DATE("'.$data[11].'","%m/%d/%Y"),'.'"'.$data[12].'","'.$data[13].'",sysdate(),"'.$newfilename.'")';

}

The above is where ETL tools come in handy as you can visually map the data file fields to the database fields.  This could have looked cleaner if I made a JSON object from the file record and passed that json object to the loader, the mapping of the json to the database field would have looked cleaner.   This is a one-off data loader so trying to just get the data is the goal.

 

Validate the data:

Depending on how your loader is developed if it hits an error it may just blow up and completely stop spewing the error.

Lets match the number of data records, in a file to what we have loaded in the database:

On Linux / Unix

 wc -l  foia_up_to_150k_AL.csv   // this command means word count count the lines in the file.

As of the writing of this article the number is:57932

Now lets see how many records we have in the database associated to this file:

select count(*) from pppdata where filename ='foia_up_to_150k_AL.csv';

57931

***Keep in mind that each of the data files, the first record of the file is a file header: LoanAmount,City,State,Zip,NAICSCode,BusinessType,RaceEthnicity,Gender,Veteran,NonProfit,JobsRetained,DateApproved,Lender,CD we DO NOT, load this so the total number of records minus one is the true count.

So (file count 57932 ) - ( the header ) = 57931 which is the number of records in our database! 

 

 

 

 

 

 

 

Disclaimer:

Not all data environments are the same, exceptions exist in the ecosphere that would render our solution a poor choice: