SOFTWARE DEVELOPMENT FOR DATA SCIENCE

SHE Level 5
SCQF Credit Points 20.00
ECTS Credit Points 10.00
Module Code MMI226556
Module Leader Gordon Morison
School School of Computing, Engineering and Built Environment
Subject Computing
Trimesters
  • A (September start)
  • C (May start)

Summary of Content

This module will focus on the development of software programming skills used for data science using an appropriate programming language such as Python or R. Implementation methods for the foundation topics in data science such as data capture, wrangling, analysis, processing, visualisation and reproducible reports will be covered. The student will gain an understanding of the various data science software ecosystems in order to apply statistical data analysis techniques (descriptive and inferential), machine learning and information visualisation techniques. This will be introduced via practical examples using both data simulation and real-world datasets to allow the student to make decisions that are supported by data.

Syllabus

Data Science Ecosystem Introduction to the software development libraries available for data science vectors, matrices and data representations. Fundamentals of Data Science Software Development Sequence, selection and iteration in a suitable programming language for data science. Data Preparation Data collection, pre-processing, cleaning, wrangling, transformation and integration. Exploratory Data Analysis Basic statistics: Population vs Sample, mean, median, mode, standard deviation, skewness, variance, correlation, covariance. Hypothesis testing Statistical distributions, standard error and confidence interval, type 1 and 2 errors, p-value, Bayes factor, test for the mean, comparing two means (dependent and independent samples). Linear Regression Simple linear regression, multiple regression, forecasting, classification and clustering. Data Visualisation Visualisation of large and small data sets using open source software libraries.

Learning Outcomes

On successful completion of this module a student should be able to:1. Demonstrate a detailed understanding of the available tools within the data science software ecosystem 2. Demonstrate a detailed understanding of software development processes for data science3. Analyse and evaluate data analysis modelling methods4. Develop software applications to conduct data analysis and visualisation5. Critically interpret and evaluate the outputs generated by analysis techniques

Teaching / Learning Strategy

The learning and teaching strategy for this module has been informed by the university's 'Strategy for Learning' design principles. The course material is introduced through lectures and laboratory/tutorial sessions that draw upon and extend the lecture material to deepen students' knowledge. The laboratory/tutorial sessions are designed as a set of formative exercises and a substantial summative exercise spanning several weeks. The formative exercises introduce a range of methodologies that allow students to gain confidence and build knowledge of the range of solutions that can be applied to particular problems. Summative exercises provide experience in real-world problem-solving and challenges students to demonstrate analytical, design & implementation skills and a capacity for divergent thinking. Tutorials (held with laboratory space) will be used to help explain and elaborate on both the lecture material and the laboratory exercises; these will include a range of case studies and live implementation examples that bring a global perspective to the subject matter. During all lab and tutorial sessions students receive formative feedback on their performance. Summative feedback and grades are also provided for the coursework assignments undertaken as part of the module, using GCULearn. GCU Learn is also used to provide the students with module specific Forums and Wikis to stimulate student and lecturer interaction outside of the normal lecture, laboratory and tutorial sessions. Flexible learning is encouraged and supported. All teaching materials and self-testing exercises are made available on GCULearn and links are provided to external materials such as podcasts, MOOCs, videos and relevant literature. All the computing resources used for laboratories are made available either by virtual machine images or low-resource-utilisation applications (made available to students for use on their own computers) or online using industry standard cloud computing services provided by major global computing industry vendors. Due to the availability of all material and computing facilities online, the module is suitable for use where Flexible and Distributed Learning (FDL) is required. Students can access complete laboratory facilities in their own time.

Indicative Reading

-2784 Python Jake VanderPlas, Python Data Science Handbook: Essential Tools for Working with Data (2016), O'Reilly Media Inc. Wes Mckinney, Python for Data Analysis, 2nd Edition (2017), O'Reilly Media Inc. R Hadley Wickham, Garrett Grolemund, R for Data Science: Import, Tidy, Transform, Visualise and Model Data (2017), O'Reilly Media Inc. Radhika Datar, Harish Garg, Hands-On Exploratory Data Analysis with R (2019), Packt Publishing

Transferrable Skills

D1 Specialist knowledge and application D2 Critical thinking and problem solving D3 Critical analysis D4 Communication skills, written, oral and listening D5 Numeracy D6 Effective information retrieval and research skills D7 Computer literacy D8 Self-confidence, self-discipline & self-reliance (independent working) D10 Creativity, innovation & independent thinking D14 Ability to prioritise tasks and time management D16 Presentation skills

Module Structure

Activity Total Hours
Tutorials (FT) 12.00
Lectures (FT) 12.00
Independent Learning (FT) 120.00
Assessment (FT) 20.00
Practicals (FT) 36.00

Assessment Methods

Component Duration Weighting Threshold Description
Course Work 02 n/a 50.00 45% Practical coursework exercise
Course Work 01 n/a 50.00 45% Practical coursework exercise