D.4 CSV, Excel, and Tabular Data Files

The KSL also has simple utilities to work with comma separated value (CSV) files and Excel files. The underlying CSV file processing library used by the KSL is Commons CSV. The KSL provision of working with CSV files is not meant to replace the functionality of those libraries. Instead, the purpose is to provide a simple facade so that users can do some simple processing without worrying about the complexities of a full featured CSV library. Figure D.3 illustrates the functions and properties of the CSVUtil object

CSVUtil Class

Figure D.3: CSVUtil Class

Assuming that the CSV data is organized with first row as a header and each subsequent row as the data for each column as follows:

"x", "y"
1.1, 2.0
4.3, 6.4

The KSL class CSVUtil has the following functions:

  • readRows() : List<Array<String>> Reads all rows into list holding the rows within an array of strings
  • readRows() : Array<DoubleArray> Reads all rows into array holding the rows within an array of doubles
  • readToColumns() : Array<DoubleArray> reads all of the rows and transposes them into the columns
  • writeArrayToCSVFile(array: Array<DoubleArray>) will write a 2-D array to a file with the CSV format, allowing an optional header and whether quotes are added to strings.
  • csvIterator(): Iterator<Array<String>> will iterate through a CSV file.

The following code illustrates the use of the CSVUtil class. An instance of a NormalRV random variable is used to make a 5 row, 4 column matrix of normal \(\mu = 0\) and \(\sigma = 1\) random variates.

    val n = NormalRV()
    val matrix = n.sampleAsColumns(sampleSize = 5, nCols = 4)
    for(i in matrix.indices){
        println(matrix[i].contentToString())
    }
    val h = listOf("col1", "col2", "col3", "col4")
    val p = KSL.csvDir.resolve("data.csv")
    CSVUtil.writeArrayToCSVFile(matrix, header = h.toMutableList(), pathToFile = p)
    println()
    val dataAsList: List<Array<String>> = CSVUtil.readRowsToListOfStringArrays(p, skipLines = 1)
    val m = KSLArrays.parseTo2DArray(dataAsList)
    for(i in matrix.indices){
        println(m[i].joinToString(prefix = "[", postfix = "]"))
    }

The contents of the matrix are printed to console. The header is made for the CSV file and then the matrix is written to the file using the CSVUtil function to write an array to a CSV file. Then, the CSVUtil object is used to read the data into list of strings, which are parsed to double values. You can use the Commons CSV functionality to read and write values via its API. The approach illustrated here is only meant for simple file processing.

Excel file operations are available within the KSL through the Apache POI library. The KSL provision of working with Excel files is not meant to replace the functionality of the POI library. Instead, the purpose is to provide a simple facade so that users can do some simple processing without worrying about the complexities of a full featured POI library. Figure D.4 illustrates the functions and properties of the ExcelUtil object.

ExcelUtil Class

Figure D.4: ExcelUtil Class

The KSL class ExcelUtil provides the basic functions primarily concerned with reading and writing tabular data:

  • createSheet(workbook: Workbook, sheetName:String) Creates a sheet within the workbook with the name.
  • writeSheetToCSV() Treats the columns as fields in a csv file, writes each row as a separate csv row in the resulting csv file
  • writeCell() Writes the object to the Excel cell
  • readCellAsObject() : Any? the data in the form of an object
  • readCellAsString() : String the data in the form of a String
  • readRowAsObjectList() : List<Any?> an list of strings representing the contents of the cells
  • readRowAsStringArray() : Array<String?> an array of strings representing the contents of the cells
  • readRowAsStringList() : List<String?> a list of strings representing the contents of the cells
  • readSheetAsObjects() : List<List<Any?>> a list of lists of the objects representing each cell of each row of the sheet
  • columnSize() number of rows that have data in a particular column as defined by not having a null cell.

To use the ExcelUtil object, it is useful to know how the Apache POI library works because you need to know how to create workbooks. The functionality of the POI library is also exposed as part of the KSL API. The main use of the object ExcelUtil within the KSL is for importing and exporting database tables and record sets to Excel.

Comma separated value (CSV) files and Excel files are commonly used to store data in a tabular (row/column) format. The KSL also provides some basic functionality for reading and writing tabular files.

A common programming necessity is to write or read data that is organized in rows and columns. While libraries that support CSV file processing are readily available, it is sometimes useful to have a quick and easy way to write (or read) tabular data that is not as sophisticated as a database. While Excel files also provide this ability, using the Apache POI may be more than is required. To make this type of file processing easier, the KSL provides a general interface for representing tabular data as a data class via the TabularData class. In addition, the TabularFile, TabularOutputFile, and TabularInputFile classes permit simple row oriented reading and writing of tabular data.

TabularData Class

Figure D.5: TabularData Class

Figure D.5 illustrates the functions and properties of the TabularData class. The basic usage of this class is to subclass a data class from it to define the contents of a row of data. This can be used for database processing as well as for working with the tabular file classes.

Tabular File Classes

Figure D.6: Tabular File Classes

Figure D.6 illustrates the functions and properties of the classes for working with tabular data stored in files.

The following code illustrates how to make a tabular file with 5 columns. The first three columns hold numeric data. The fourth column holds text, and the fifth column holds numeric data. The enum DataType defines two types of data to be held in columns: numeric or text. This covers a vast majority of simple computing needs.

    val path: Path = KSL.outDir.resolve("demoFile")
    // configure the columns
    val columns: MutableMap<String, DataType> = TabularFile.columns(3, DataType.NUMERIC).toMutableMap()
    columns["c4"] = DataType.TEXT
    columns["c5"] = DataType.NUMERIC
    // make the file
    val tif = TabularOutputFile(columns, path)
    println(tif)

The columns() function is used to define the columns as a map of column names and the data types. This describes a type of column within the tabular file.

The numeric type should be used for numeric data (float, double, long, int, etc.). In addition, use the numeric type for boolean values, which are stored 1.0 = true, 0.0 = false). The text type should be used for strings and date/time data. Date and time data is saved as ISO8601 strings (“YYYY-MM-DD HH:MM:SS.SSS”). If you need more type complexity, you should use a database or some other more advanced serialization method such as JSON, Parquet, or Arrow.

The following code creates some random data and writes rows of data to the file. The RowSetterIfc interface is use in that it allows for the writing of any numeric columns and any text columns as separate row operations.

    // needed for some random data
    val n = NormalRV(10.0, 1.0)
    val k = 15
    // get a row
    val row: RowSetterIfc = tif.row()
    // write some data to each row
    println("Writing rows...")
    for (i in 1..k) {
        // reuse the same row, many times
        // can fill all numeric columns
        row.setNumeric(n.sample(5))
        // can set specific columns
        row.setText(3, "text data $i")
        // need to write the row to the buffer
        tif.writeRow(row)
    }
    // don't forget to flush the buffer
    tif.flushRows()
    println("Done writing rows!")

The tabular file package also allows for the defintion of the data schema via data class. The type of the properties of the data class are translated to either text or numeric columns. The following code illustrates this functionality. Here a data class is used to define a text field and a numeric field, which are then written to the file.

    val path: Path = KSL.outDir.resolve("TabularDataFile")
    data class SomeData(var someText: String = "", var someData: Double = 0.0): TabularData("SomeData")
    val rowData = SomeData()
    // use the data class instance to define the columns and their types
    val tof = TabularOutputFile(rowData, path)
    println(tof)

    // needed for some random data
    val n = NormalRV(10.0, 1.0)
    val k = 15
    // write some data to each row
    println("Writing rows...")
    for (i in 1..k) {
        // reuse the same row, many times
        // can fill all numeric columns
        rowData.someData = n.value
        rowData.someText = "text data $i"
        // need to write the row to the buffer
        tof.writeRow(rowData)
    }
    // don't forget to flush the buffer
    tof.flushRows()
    println("Done writing rows!")

As long as the data class is compatible with the rows of the file, then the same approach can be used to read in the data as illustrated in the next code snippet.

    val tif = TabularInputFile(path)
    println(tif)

    // TabularInputFile is Iterable and foreach construct works across rows
    println("Printing all rows from an iterator using data class")
    for (row in tif.iterator()) {
        rowData.setPropertyValues(row)
        println(rowData)
    }
    println()

Reading data from the file has a bit more functionality through some useful iterators. The following code opens a tabular file and iterates the rows.

    val path: Path = KSL.outDir.resolve("demoFile")
    val tif = TabularInputFile(path)
    println(tif)

    // TabularInputFile is Iterable and foreach construct works across rows
    println("Printing all rows from an iterator")
    for (row in tif.iterator()) {
        println(row)
    }
    println()

You can fetch specific rows or a subset of rows. Multiple iterators can be active at the same time.

    // You can fetch rows as a list
    println("Printing a subset of rows")
    val rows: List<RowGetterIfc> = tif.fetchRows(1, 5)
    for (row in rows) {
        println(row)
    }
    println()

You can start the iterator at a particular row.

    println("Print starting at row 9")
    val iterator: TabularInputFile.RowIterator = tif.iterator(9)
    while (iterator.hasNext()) {
        println(iterator.next())
    }
    println()

You can grab various columns as arrays.

    println("Printing column 0")
    val numericColumn: DoubleArray = tif.fetchNumericColumn(0, 10, true)
    for (v in numericColumn) {
        println(v)
    }

You can write the data to an Excel workbook.

    try {
        tif.exportToExcelWorkbook("demoData.xlsx", KSL.excelDir)
    } catch (e: IOException) {
        e.printStackTrace()
    }

You can pretty print rows of the data and export the data to a CSV file.

    tif.printAsText(1, 5)
    val printWriter: PrintWriter = KSL.createPrintWriter("data.csv")
    tif.exportToCSV(printWriter, true)
    printWriter.close()

You can even convert the tabular data to an SQLLite database or turn the data into a data frame.

    try {
        val database: DatabaseIfc = tif.asDatabase()
    } catch (e: IOException) {
        e.printStackTrace()
    }
    println()
    println("Printing a data frame version")
    val df = tif.asDataFrame()
    println(df)

Much of this functionality is used within the implementation of the KSL database implementation, which will be discussed in a subsequent section. Since we just mentioned data frames, we discuss that functionality in the next section.