As stated previously, the project management system architecture needed to support associative management of structured and unstructured data gathered in support of infrastructure projects must consist of three main parts: (1) data harvesting subsystem, (2) data storage subsystem, and (3) data analysis subsystem. There are many approaches one could consider in implementing these three subsystems and how they interface to each other. From our evaluation of present and future requirements for managing infrastructure projects using a gig-workforce, we have chosen a system architecture and related technologies to support the “data system” for the platform. The following provides a high-level description of the primary subsystem components to support the platform’s data system.
Key system technologies and/or structures targeted to support managing data and information include:
Open Data Format (ODF – http://www.opengroup.org/iot/odf/ ) – The ODF is specified using XML Schema. It defines a simple and extensible ontology for data storage that allows the creation of information structures that are similar to those of objects and properties in object-oriented programming. It is thereby generic enough for the representation of any object and information needed for information exchange in domains such as physical infrastructure.
Structured Query Language Relational Database (SQL – https://en.wikipedia.org/wiki/SQL ) – Primarily used for management, transaction processing, queries and analysis of structured data with tabular relations. Most of the data gathered while completing a workflow document or form will be stored in a relational database. Companies like IBM, Oracle, Microsoft, SAP, Teradata, and many others have SQL database environments. Often the best database software to use is driven by a company’s existing data and their data system. We will not make a specific recommendation at this time relative to which SQL database system would be best.
NoSQL (originally referring to “non-SQL” – https://en.wikipedia.org/wiki/NoSQL ) – A database structure which provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases (e.g. SQL). The data structure options or classifications for NoSQL databases include key–value pair, wide column, graph, or document. The proposed data structure for unstructured data is a “document” store given its ability to integrate well with SQL databases and associative queries. A document store assumes that documents encapsulate and encode data (or information) in some standard formats or encodings. Documents are addressed in the database via a unique key that represents that document. Another defining characteristic of a document-oriented database is an API or query language to retrieve documents based on their contents.
An example of a viable NoSQL database is MongoDB (https://www.mongodb.com/), a cross-platform document-oriented NoSQL database program. MongoDB uses JSON-like documents with optional schemas. It provides high performance, high availability, and easy scalability. MongoDB is developed by MongoDB Inc. and licensed under the Server Side Public License (SSPL).
Hadoop (https://en.wikipedia.org/wiki/Apache_Hadoop) – An open source big data framework from the Apache Software Foundation designed to handle huge amounts of data on clusters of servers. Hadoop is a general-purpose form of distributed processing that has a distributed file system (HDFS), a scheduler that coordinates application runtimes (YARN), and an algorithm that actually processes the data in parallel (e.g. MapReduce and Spark).
Apache Spark (https://en.wikipedia.org/wiki/Apache_Spark) – A data processing framework that can quickly perform processing tasks (e.g. machine learning) on very large data sets, and can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools. Spark has become the framework of choice when processing big data, overtaking the old MapReduce paradigm that brought Hadoop to prominence. The advantages of the Spark framework include speed and its APIs.
Our primary goal in this system architecture is to provide the necessary functionality, flexibility and scale in support of effective use and association across data types needed to support value creation in an infrastructure focused project management system. We believe the proposed structure will work.