Frictionless Data, Frictionless Development
Andrew Stretton | Friday 10:30 | Room C
A common problem in Data Engineering is how to create a platform capable both of importing and exporting tabular data in numerous formats and of maintaining a change history of the data while users update and query it.
Tools like Trifacta (Google Cloud Dataprep [1]) provide a turnkey solution to part of the pipeline but the open source Frictionless Data [2] tools from OKFN can provide a simpler subset of these features tailored to your requirements.
Just as Pandas [3] is built around the Dataframe, the Frictionless Data approach uses data packages [4] consisting of a JSON table schema and a data URI. These schemata can be easily generated for any dataset and work well for a number of applications such as:
Validating new data with tools like Goodtables [5] or tableschema-py
Building a data update interface with tools such as Handontable JS [6]
Creating declarative data processing pipelines that a front end can easily interact with via datapackages pipelines [7] and kubernetes [8]
Pushing data into various databases and repository tools such as CKAN datastore [9]
Extending the schema to allow export to linked data formats such as IIIF
The talk will cover these use cases and compare with the approaches taken by other open-source data science / BI tools such as Datashape [10] with ODO [11] from Continuum and Superset [12] from AirBnB. I will aim to demonstrate that that lightweight web standards like datapackages speed up the development process.
References
- https://cloud.google.com/dataprep/
- http://frictionlessdata.io/tools/
- http://pandas.pydata.org/
- http://frictionlessdata.io/data-packages/
- http://goodtables.okfnlabs.org/
- https://github.com/handsontable/handsontable
- https://github.com/frictionlessdata/datapackage-pipelines
- https://kubernetes.io/
- https://github.com/ckan/ckan/tree/master/ckanext/datastore
- https://github.com/blaze/datashape
- https://github.com/blaze/odo
- https://github.com/apache/incubator-superset