JJ Nicholas: October 2021

Monday, 11 October 2021

Data Science with Python Simulation Test 1

Data Science with Python Simulation Test 1

1. What is the rank of the numpy array? array([[ 0, 4, 2], [ 9, 3, 7]])

SELECT THE CORRECT ANSWER: Rank 2

2. Choose the correct output of the following program: >>> a = np.array([11, 12, 13, 14]) >>> b = np.array([1, 2, 3, 4]) >>> c = a - b >>>c

SELECT THE CORRECT ANSWER: Array[10, 10, 10, 10]

3. Which of the following data structures of Pandas can handle 3D data?

SELECT THE CORRECT ANSWER: Panel

4. To combine datasets, the ____ function of Pandas can be utilized.

SELECT THE CORRECT ANSWER: Concat

5. What is the output of a and b? Given: a = 9/2 b = 5.2/2

SELECT THE CORRECT ANSWER: a = 4.5, b =2.6

6. A list is collection of values of multiple data types and can:

SELECT THE CORRECT ANSWER: add, update, remove

Project 4 -- Retail Analysis with Walmart Data

Retail Analysis with Walmart Data

DESCRIPTION

One of the leading retail stores in the US, Walmart, would like to predict the sales and demand accurately. There are certain events and holidays which impact sales on each day. There are sales data available for 45 stores of Walmart. The business is facing a challenge due to unforeseen demands and runs out of stock some times, due to the inappropriate machine learning algorithm. An ideal ML algorithm will predict demand accurately and ingest factors like economic conditions including CPI, Unemployment Index, etc.

Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of all, which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data. Historical sales data for 45 Walmart stores located in different regions are available.

Dataset Description

This is the historical data that covers sales from 2010-02-05 to 2012-11-01, in the file Walmart_Store_sales. Within this file you will find the following fields:

Store - the store number

Date - the week of sales

Weekly_Sales - sales for the given store

Holiday_Flag - whether the week is a special holiday week 1 – Holiday week 0 – Non-holiday week

Temperature - Temperature on the day of sale

Fuel_Price - Cost of fuel in the region

CPI – Prevailing consumer price index

Unemployment - Prevailing unemployment rate

Holiday Events

Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13

Labour Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13

Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13

Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

Analysis Tasks

Basic Statistics tasks

Which store has maximum sales

Which store has maximum standard deviation i.e., the sales vary a lot. Also, find out the coefficient of mean to standard deviation

Which store/s has good quarterly growth rate in Q3’2012

Some holidays have a negative impact on sales. Find out holidays which have higher sales than the mean sales in non-holiday season for all stores together

Provide a monthly and semester view of sales in units and give insights

Statistical Model

For Store 1 – Build prediction models to forecast demand

Linear Regression – Utilize variables like date and restructure dates as 1 for 5 Feb 2010 (starting from the earliest date in order). Hypothesize if CPI, unemployment, and fuel price have any impact on sales.

Change dates into days by creating new variable.

Select the model which gives best accuracy.

Good Luck!!!

Project 3 -- Comcast Telecom Consumer Complaints

Comcast Telecom Consumer Complaints .

DESCRIPTION

Comcast is an American global telecommunication company. The firm has been providing terrible customer service. They continue to fall short despite repeated promises to improve. Only last month (October 2016) the authority fined them a $2.3 million, after receiving over 1000 consumer complaints.

The existing database will serve as a repository of public customer complaints filed against Comcast.

It will help to pin down what is wrong with Comcast's customer service.

Data Dictionary

Ticket #: Ticket number assigned to each complaint

Customer Complaint: Description of complaint

Date: Date of complaint

Time: Time of complaint

Received Via: Mode of communication of the complaint

City: Customer city

State: Customer state

Zipcode: Customer zip

Status: Status of complaint

Filing on behalf of someone

Analysis Task

To perform these tasks, you can use any of the different Python libraries such as NumPy, SciPy, Pandas, scikit-learn, matplotlib, and BeautifulSoup.

- Import data into Python environment.

- Provide the trend chart for the number of complaints at monthly and daily granularity levels.

- Provide a table with the frequency of complaint types.

Which complaint types are maximum i.e., around internet, network issues, or across any other domains.

- Create a new categorical variable with value as Open and Closed. Open & Pending is to be categorized as Open and Closed & Solved is to be categorized as Closed.

- Provide state wise status of complaints in a stacked bar chart. Use the categorized variable from Q3. Provide insights on:

Which state has the maximum complaints

Which state has the highest percentage of unresolved complaints

- Provide the percentage of complaints resolved till date, which were received through the Internet and customer care calls.

The analysis results to be provided with insights wherever applicable.

Good Luck!!!

Project 2 -- Movielens Case Study

Movielens Case Study

DESCRIPTION

Background of Problem Statement : The GroupLens Research Project is a research group in the Department of Computer Science and Engineering at the University of Minnesota. Members of the GroupLens Research Project are involved in many research projects related to the fields of information filtering, collaborative filtering, and recommender systems. The project is led by professors John Riedl and Joseph Konstan. The project began to explore automated collaborative filtering in 1992 but is most well known for its worldwide trial of an automated collaborative filtering system for Usenet news in 1996. Since then the project has expanded its scope to research overall information by filtering solutions, integrating into content-based methods, as well as, improving current collaborative filtering technology.

Problem Objective :

Here, we ask you to perform the analysis using the Exploratory Data Analysis technique. You need to find features affecting the ratings of any particular movie and build a model to predict the movie ratings.

Domain: Entertainment

Analysis Tasks to be performed:

Import the three datasets

Create a new dataset [Master_Data] with the following columns MovieID Title UserID Age Gender Occupation Rating. (Hint: (i) Merge two tables at a time. (ii) Merge the tables using two primary keys MovieID & UserId)

Explore the datasets using visual representations (graphs or tables), also include your comments on the following:

User Age Distribution

User rating of the movie “Toy Story”

Top 25 movies by viewership rating

Find the ratings for all the movies reviewed by for a particular user of user id = 2696

Feature Engineering:

Use column genres:

Find out all the unique genres (Hint: split the data in column genre making a list and then process the data to find out only the unique categories of genres)

Create a separate column for each genre category with a one-hot encoding ( 1 and 0) whether or not the movie belongs to that genre.

Determine the features affecting the ratings of any particular movie.

Develop an appropriate model to predict the movie ratings

Dataset Description :

These files contain 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users who joined MovieLens in 2000.

Ratings.dat

Format - UserID::MovieID::Rating::Timestamp

Field Description

UserID Unique identification for each user

MovieID Unique identification for each movie

Rating User rating for each movie

Timestamp Timestamp generated while adding user review

UserIDs range between 1 and 6040

The MovieIDs range between 1 and 3952

Ratings are made on a 5-star scale (whole-star ratings only)

A timestamp is represented in seconds since the epoch is returned by time(2)

Each user has at least 20 ratings

Users.dat

Format - UserID::Gender::Age::Occupation::Zip-code

Field Description

UserID Unique identification for each user

Genere Category of each movie

Age User’s age

Occupation User’s Occupation

Zip-code Zip Code for the user’s location

All demographic information is provided voluntarily by the users and is not checked for accuracy. Only users who have provided demographic information are included in this data set.

Gender is denoted by an "M" for male and "F" for female

Age is chosen from the following ranges:

Value Description

1 "Under 18"

18 "18-24"

25 "25-34"

35 "35-44"

45 "45-49"

50 "50-55"

56 "56+"

Occupation is chosen from the following choices:

Value

Description

0 "other" or not specified

1 "academic/educator"

2 "artist”

3 "clerical/admin"

4 "college/grad student"

5 "customer service"

6 "doctor/health care"

7 "executive/managerial"

8 "farmer"

9 "homemaker"

10 "K-12 student"

11 "lawyer"

12 "programmer"

13 "retired"

14 "sales/marketing"

15 "scientist"

16 "self-employed"

17 "technician/engineer"

18 "tradesman/craftsman"

19 "unemployed"

20 "writer”

Movies.dat

Format - MovieID::Title::Genres

Field Description

MovieID Unique identification for each movie

Title A title for each movie

Genres Category of each movie

Titles are identical to titles provided by the IMDB (including year of release)

Genres are pipe-separated and are selected from the following genres:

Action

Adventure

Animation

Children's

Comedy

Crime

Documentary

Drama

Fantasy

Film-Noir

Horror

Musical

Mystery

Romance

Sci-Fi

Thriller

War

Western

Some MovieIDs do not correspond to a movie due to accidental duplicate entries and/or test entries

Movies are mostly entered by hand, so errors and inconsistencies may exist.

Good Luck!!!

Project: Customer Service Requests Analysis

Customer Service Requests Analysis

DESCRIPTION

Background of Problem Statement : NYC 311's mission is to provide the public with quick and easy access to all New York City government services and information while offering the best customer service. Each day, NYC311 receives thousands of requests related to several hundred types of non-emergency services, including noise complaints, plumbing issues, and illegally parked cars. These requests are received by NYC311 and forwarded to the relevant agencies such as the police, buildings, or transportation. The agency responds to the request, addresses it, and then closes it.

Problem Objective :

Perform a service request data analysis of New York City 311 calls. You will focus on the data wrangling techniques to understand the pattern in the data and also visualize the major complaint types.

Domain: Customer Service

Analysis Tasks to be performed:

(Perform a service request data analysis of New York City 311 calls)

Import a 311 NYC service request.

Read or convert the columns ‘Created Date’ and Closed Date’ to datetime datatype and create a new column ‘Request_Closing_Time’ as the time elapsed between request creation and request closing. (Hint: Explore the package/module datetime)

Provide major insights/patterns that you can offer in a visual format (graphs or tables); at least 4 major conclusions that you can come up with after generic data mining.

Order the complaint types based on the average ‘Request_Closing_Time’, grouping them for different locations.

Perform a statistical test for the following:

Please note: For the below statements you need to state the Null and Alternate and then provide a statistical test to accept or reject the Null Hypothesis along with the corresponding ‘p-value’.

Whether the average response time across complaint types is similar or not (overall)

Are the type of complaint or service requested and location related?

Dataset Description :

Field Description

Unique Key (Plain text) - Unique identifier for the complaints

Created Date (Date and Time) - The date and time on which the complaint is raised

Closed Date (Date and Time) - The date and time on which the complaint is closed

Agency (Plain text) - Agency code

Agency Name (Plain text) - Name of the agency

Complaint Type (Plain text) - Type of the complaint

Descriptor (Plain text) - Complaint type label (Heating - Heat, Traffic Signal Condition - Controller)

Location Type (Plain text) - Type of the location (Residential, Restaurant, Bakery, etc)

Incident Zip (Plain text) - Zip code for the location

Incident Address (Plain text) - Address of the location

Street Name (Plain text) - Name of the street

Cross Street 1 (Plain text) - Detail of cross street

Cross Street 2 (Plain text) - Detail of another cross street

Intersection Street 1 (Plain text) - Detail of intersection street if any

Intersection Street 2 (Plain text) - Detail of another intersection street if any

Address Type (Plain text) - Categorical (Address or Intersection)

City (Plain text) - City for the location

Landmark (Plain text) - Empty field

Facility Type (Plain text) - N/A

Status (Plain text) - Categorical (Closed or Pending)

Due Date (Date and Time) - Date and time for the pending complaints

Resolution Action Updated Date (Date and Time) - Date and time when the resolution was provided

Community Board (Plain text) - Categorical field (specifies the community board with its code)

Borough (Plain text) - Categorical field (specifies the community board)

X Coordinate (State Plane) (Number)

Y Coordinate (State Plane) (Number)

Park Facility Name (Plain text) - Unspecified

Park Borough (Plain text) - Categorical (Unspecified, Queens, Brooklyn etc)

School Name (Plain text) - Unspecified

School Number (Plain text) - Unspecified

School Region (Plain text) - Unspecified

School Code (Plain text) - Unspecified

School Phone Number (Plain text) - Unspecified

School Address (Plain text) - Unspecified

School City (Plain text) - Unspecified

School State (Plain text) - Unspecified

School Zip (Plain text) - Unspecified

School Not Found (Plain text) - Empty Field

School or Citywide Complaint (Plain text) - Empty Field

Vehicle Type (Plain text) - Empty Field

Taxi Company Borough (Plain text) - Empty Field

Taxi Pick Up Location (Plain text) - Empty Field

Bridge Highway Name (Plain text) - Empty Field

Bridge Highway Direction (Plain text) - Empty Field

Road Ramp (Plain text) - Empty Field

Bridge Highway Segment (Plain text) - Empty Field

Garage Lot Name (Plain text) - Empty Field

Ferry Direction (Plain text) - Empty Field

Ferry Terminal Name (Plain text) - Empty Field

Latitude (Number) - Latitude of the location

Longitude (Number) - Longitude of the location

Location (Location) - Coordinates (Latitude, Longitude)

Good Luck!!!