Code Changes

Progress Update


Data Analysis Techniques
Link

Quantative data – anything measurable usually numerical figures and quantities. Ex. Sales figures, website visitors, and percents of revenue increase.
Quantative data analysis techniques – focuses on statistical, mathematical, or numerical analysis. Uses algorithms and computational techniques to explain patterns and make predictions
Qualitative data – cannot be measured objectively and more open to interpretation. Qualitative data can include comments on a survey, product reviews, and social media posts
Qualitative data analysis techniques – makes sense of the unstructured data and is organized into themes.
Covers 7 most useful methods of data analysis:

  • Regression analysis – estimate how one or more variables might impact the dependent variable, in order to identify trends and patterns. This is especially useful for making predictions and forecasting future trends.
  • Monte Carlo simulation - a computerized technique used to generate models of possible outcomes and their probability distributions. It considers a range of possible outcomes and then calculates how likely it is that each particular outcome will be realized. Used by data analysts to conduct advanced risk analysis.
  • Factor analysis - a technique used to reduce a large number of variables to a smaller number of factors. It works on the basis that multiple separate, observable variables correlate with each other because they are all associated with an underlying construct; this is called covariance. This allows datasets to be compressed into smaller samples and uncover hidden patterns, like concepts that aren’t easily measured.
  • Cohort analysis – a technique that divides data that has common characteristics into related groups for analysis. Often used to target more specific customer segments and personalized experience.
  • Cluster analysis – sorts different data points into groups that are similar to each other and not similar to data points in another cluster. Seeks to find structures within the dataset
  • Time series analysis – a sequence of data points which measure the same variable at different points in time. Used to identify trends and cycles over time and allows analysts to forecast how they may fluctuate in the future.
  • Sentiment analysis – the process of sorting and understanding textual data to interpret and classify the emotions conveyed within the textual data. This allows to gain a feeling of how your customers satisfaction about the brand or service.

Hashing Algorithms in Python
Link

Covers the following hashing library in Python called hashlib

Types of hash algorithms included:

  • SHA1
  • SHA224
  • SHA256
  • SHA384
  • SHA512
  • MD5
Hashlib Properties
  • algorithms_guaranteed - a set containing the list of algorithms that are supported by the module on all platforms
  • algorithms_available - a set of algorithms that are available in the running Python interpreter.
Hash Properties - Constants
  • digest_size - the byte size of the hash result
  • block_size - the internal block size of the hash algorithm in bytes
Hash Properties - instance variables
  • name - the canonical name of the hash in lowercase. Can be used with new() as a parameter to make an instance of a hash of the corresponding type
Hash Functions
  • update(data) - updates the object with bytes-like object (ex: bytes, byte array, etc)
  • digest() - returns a digest of data passed to update() method. Acceptable range is 0 to 255
  • hexdigest() - similar to digest() except it returns a string only containing hexadecimal digits
  • copy() - makes a copy of the object

PyCon - Finding a Home in Singapore Using a Data Driven Approach
Link

A woman used open data to find an ideal home in Singapore

99.co – property search portal, used to look up rental/Airbnb

Even after entering search parameters, still faced with thousands of listings

Problem: how do you choose best possible listing?

First, looked at MRT station is cheapest to live around. Made a map visualization

Used geolocation file of MRT map around Singapore. Data comes in both .shp and .dbf file types, which are read by shapefile. Reader in Python to read the records

Next, she checked how many shape objects are in the shape file, corresponds to the number of MRT stations in Singapore. Then check the type of shapes in the shape file.

Next, she prints the fields of the object. Prints out [variable name, the type of field as an abbreviation, the max size of the field, number of decimal places (if applicable. If not, default value is 0)]

Next she iterates through the first 3 records, which prints out coordinates for the object, which is a train station in Singapore, in latitude/longitude. Singapore has it’s own coordinate system which she then converts the coordinates to, to get a more accurate location, within the city, using an SVY package library in Python.

Now that the coordinates are converted, she uses Basemap (another Python library to plot maps) to plot the maps coordinates and draw the outline of the country, and add markers to the map. Then she uses Folium, a Python package, to make the map. Next iterate through the list of coordinates and add a marker for each record, and adds the name attribute as a pop-up for each coordinate.

Next she computes the price per square foot around each station. To do this, she retrieves open source data about housing and development prices. Problem is the data doesn’t contain coordinate information; to compensate for this, she uses an API to submit an address, which returns the coordinates. Now she uses GeoPy to compute the distances between addresses. She also filters out coordinates that are more than 1km from the stations.
She then iterates through the stations and uses pre-computed transactions over the past 3 months, 1 km from the stations, and normalizes them. If zero, the markers are color coded as white; otherwise, she uses the 0-1 scale value to determine a hue based on the price. Blue for lowest cost and red for highest. Finally, she adds a price to the marker.

The result is that the prices in the central business districts are more expensive than elsewhere, which is not surprising. But the rate at which the prices decline at a further rate closer to the border at about x2 vs closer to the central business district x4. As a result, she was able to make a data driven decision about where to live.

The rest of the video is just her going over source code on both Github and jupyter. Also live editing code to display visual changes to the map to the audience, like changing the marker shapes and colors


Downloading CSV from a URL in Python
Link
Basic steps of downloading a csv file from URL:
  1. Use a GET request with the CSV url
  2. Generate the content from the response
  3. Open the the file using wb mode
  4. Write the contents of the file to he desired location
  5. Close the file

Multiple Linear Regression using Python
Link
Basic steps:
  1. Import the data
  2. Check for data that has null values (NaN). If any records appear with null values, replace it with constant(numerical value)
  3. Split data into coordinates; input values as x and output as y, as a numpy array
  4. Split the coordinates into random train and test subsets, using a percentage floating point value that defines the subset to use
  5. Use model.score() using xTest and yTest values to calculate your accuracy percentage
  6. Perform a linear regression on the object and perform a fit using the X and Y train values
  7. Use model.predict() using the xTest to return an array of prediction values
  8. Iterate through the array of test values and use the yTest values (at the current index) minus the test value to get the margin of error

Uploading JSON Files

Link

To read the contents of a JSON file, use the statement import json to use a Python library dedicated to JSON files. Use a with statement and the function open(filename) to open the JSON as a file. Then use json.loads(myString) to parse the data into a dictionary. From there, you can manipulate the data to create your own Python objects.

Virus Scanning Files


If your website allows users to upload files to a server, your server needs to be protected from any malicious hackers out there. There's no easy solution to implementing virus scanning protection on your website outside of purchasing your own virus scanning software. Fortunately, SharpAPIs provides a more cost effective solution.

Free to use, sharpapis allows you to send the file for a virus scan via a POST request and the response contains the result. However, the free plan only contains 100 free scans, which refresh every 21 days, but it can be upgraded to 1000, or more as needed for an added cost.

  1. Sign up for an account and confirm your account using the email they send. Be sure to check your Junk mail folder!
  2. Grab your X-ApplicationID and your X-Secret Key from your account and record them for later.

Python version

  1. Install the request and json libraries
  2.             
                    
                        
                        pip3 install request
                        pip3 install json
                    
                
            

  3. Next I created a new script called virusScanFile.py
  4. Create the url and the directory to fetch the file. Then, in the header of the request, use the X-ApplicationID and X-SecretKey you wrote down earlier as strings in the header.
  5.                     
                            
                
            

  6. Open the file using "rb" mode, define data and files using the filename and reading the file in.
  7.             
                    
                
            

  8. Send as a POST request using the 4 parameters, and check the response for the result
  9.             
                    
                
            

C# version

  1. Install the Newtonsoft.Json library from the NuGet Package Manager.
  2. Setup the function in your controller
  3. Create the url and the directory to fetch the file. Then, in the header of the request, use the X-ApplicationID and X-SecretKey you wrote down earlier as strings in the header.
  4.             
                    
                
            

  5. Open the file using "rb" mode, define data and files using the filename and reading the file in.
  6.             
                    
                
            

  7. Send as a POST request using the 4 parameters, and check the response for the result
  8.             
                    
                
            

Implementation notes:

  • Files need to be uploaded to the server first before they can be read and scanned, unless you are sending a file from a URL.
  • If the file size is over 50000KB, your request may return a response with a 400 series error because there is too much content in the request. You may want to consider scanning portions of the file instead of the whole thing at a time
  • If you are reading directly from the file, the file content needs to be read as binary format before being inserted into the request content.
  • SharpApis also provides ways to get statuses from previous virus scans if you send the id of the scan as a GET request. This is particularly useful for auditing if you are storing all scans in a database and want to search on the status of a particular scan in the future.
Source

Setting Data Expiry in MongoDB


Usually when data is inserted into a database, the data is generally considered persistent cause most cases the data is stored permanently in records. However, the drawback of this is storage and memory concerns, as memory isn't an infinite resource. For the Data2Int website, data that is uploaded by users doesn't persist for longer than 24hrs.

I tried a couple different approaches to this as a developer. I first tried writing a scheduling task within the application, that would call a function at a scheduled time.

            
                
            
        

However, the problem with this approach comes with trying to call this task within the application, as it needs to run on its own, rather than be triggered by a web page event. That was one of the main reasons this approach didn't work.

After further research and reaching out online for help from others, I learned of TTL Indexes in MongoDB. "TTL indexes are special single-field indexes that MongoDB can use to automatically remove documents after a certain amount of time or specific clock time..." [1]. A background thread in mongod runs every 60 seconds; it should be noted that data may persist beyond the 60 second mark depending on the speed of the background process.

To specify the amount of time you want the data to last, use the create_index function and overload the expiryAfterSeconds property. You can manually specify the time by feeding this value as seconds. If you're intending on inserting more than one new document into the collection before the index runs out, you'll need to ensure the index name is different than the last one being inserted. To do this, I used Python's random module, and appended a randomly generated integer to the end of the index name, before insertion.

            
                
            
        

You can either pass a hardcoded value, like 86400 (the number of seconds in a day). If you need this to happen around a specified time, this is possible, but you would need to calculate the difference in time yourself. You can do this by using a function to return a timedelta, which is exactly the change in the time and using the datetime module in Python. The following example shows you how to calculate the difference in time with midnight in mind.

            
                
            
        

This is just one of a few ways to calculate the time remaining, but it should be noted that using the UTC timezone value is considered best practice rather, than using the local datetime.

Source:[1] TTL Indexes

Personal Project: Covid-19 in Ontario - An Analysis


Preface

The purpose of this project will be to analyze Covid-19 trends within the past year and a half, with a focus on trends in Ontario, and a small comparison between other Canadian provinces. Topics of interest that I will cover are ICU admissions due to Covid-19 in Ontario, vaccination rates across Ontario, case count comparisons between Ontario and other provinces, etc. These visualizations will be used to uncover persisting trends and give an overall analysis of how Ontario has fared the pandemic up to this point.

Visualizing the Data

Data Sources:
Figure 1 and 2: Status of COVID-19 cases in Ontario
Figure 3: Areas in Canada with cases of COVID-19 as of November 29, 2021
Figure 4: Ontario COVID-19 Vaccine data


Analyzing the Data

Figure 1 is a line graph displaying how many patients which are being admitted to the ICU due to Covid-19 across the province. The relationship between the data in this graph shows the fluctuation in admissions over time since the pandemic began. It’s important to note that this metric was only starting to be tracked as of May 2020, which explains the sharp increase in admissions from 0 in March and April 2020 to ~250 case spike in May 2020. Largely through 2020, Ontario was able to keep the ICU admissions under control, until about December 2020, when cases began to sharply increase and continued to do so into the beginning of 2021. This spike can be attributed to gatherings over the Christmas and New Years holidays, as well as low uptake of the vaccine at the time due to limited supplies available at the time. This spike in ICU counts can be correlated with the stats in Figure 2, which also shows a sharp increase in ICU admissions due to Covid-19 beginning in December 2020 and by January 2, 2021, Ontario reported 3,363 new COVID-19 cases, "the most in one day since the pandemic began." [1]

In early 2021, the province endured an even worse time keeping case counts and ICU admissions under control. The province had enforced its 2nd strict province-wide lockdown, by December 26, 2020, which had been escalated to a stay-at-home order by January 14, 2021.[1] By mid-February, the province had made a rash decision to lift the stay-at-home order in most PHUs and reopen the province, despite many warnings from top health officials. However, vaccine supply was still limited to mainly seniors and LTC residents. As well, the Alpha variant was beginning to spread rapidly in areas that were open. "The combination of that optimism from a successful lockdown leading to governments wanting to reopen and the background of these variants of concern emerging, plus, delays in the vaccine arrival is setting up really this perfect storm for a massive third wave,” said Dr. Brooks Fallis, a critical care physician in Toronto".[2] This forced the province to enact it’s third province-wide lockdown and stay-at-home order which was in effect from April 3, 2021 to mid-June 2021. Palin writes, with reference to Wave 2, "Restrictions were lifted too early, and some provincial governments failed to bring in policies like paid sick leave that would suppress workplace outbreaks."[3] By late April, ICU admissions spiked to ~900 patients being admitted to the ICU. Figure 2, shows a sharp increase from about 300,000 total cases at the start of 2021 to over 500,000 cases by April 2021.

Thankfully with a combination of stringent public health measures and increased vaccine availability, the cases finally began to start plummeting around May 2021, and by July 2021, "over 77 per cent of the population in Ontario ages 12 and over have received one dose of a COVID-19 vaccine and over 50 per cent have received a second dose"[4], and the ICU steadily declining, the province entered Stage 3 of Reopening. Figure 4 shows the rate of vaccination Ontario with both first and second doses. We largely see the curve stay flat until about late April, which is when key demographics, like 40 and older were eligible for the increased supply of vaccines available. This also correlates to how sharply the ICU admissions due to Covid-19 start to plummet in early May, as seen in Figure 1, because that's how long it takes for a person to gain immunity from a first dose of either mRNA vaccine, "which takes 2 weeks after vaccination for the body to build protection (immunity) against the virus." [5]

In Figure 2, since May 2021, we’ve seen a relatively flattening of new Covid-19 case counts in Ontario, seeing only about an increase of about 100,000 cases over the past 6 months compared to over to about a 200,000 increase in reported cases, between April and June. This can largely be attributed to the rapid vaccine uptake in the province, as well as the vaccines being more widely available since then. While the first year of the pandemic (March 2020 to April 2021) sees the relationship between case counts and ICU admissions to Covid-19 as closely linked, adding vaccines to the equation seems to have largely decoupled this relationship so far. Ontario saw a small increase in cases from August to October, due to the circulation of the more transmissible Delta variant, but the province has since remained open with ICU admissions being significantly lowered, compared to the spike in Wave 3, during Spring 2021.

Figure 3 shows a comparison of total Covid-19 cases to date across all provinces, as well as the Canadian population as a whole. Looking objectively, Ontario is the clear leader to this point for Covid-19 cases; there are several reasons for potentially for this:

  1. Ontario has the highest population per of any province or territory in Canada at 14,759,431 as of the beginning of 2021, behind Quebec with a population of 8,579,370 people. This results in more cases in general as well as more lengthy vaccine rollout to the general population. [6]
  2. For most of the pandemic, the Conservative government offered little support to ailing business and individuals struggling financially, such as no paid-sick leave for those needing to isolate, in the middle of the third provincial lockdown. After a year of pandemic and calls from both opposition government and top medical officials, the Ford government re-instituted 3 paid sick days, which "pays up to $200 per day for workers taking time off due to being sick, displaying Covid-19 symptoms, need vaccination, or experiencing mental health issues. [7]
  3. The Ontario vaccine rollout program prioritized individuals by age for eligibility for the vaccine at a staggered pace. The reasoning for not moving to essential workers immediately was justified by "that most COVID-19 hospitalizations and deaths involve people over 50". [8] However, with the introduction of more transmissible variants like Alpha and Delta, the program overlooked populations that were at higher risk of contracting the virus, like public facing workers. By choosing this strategy, this decision drove community spread more than it should’ve.

However, Ontario and Quebec follow largely similar trajectories, being the highest populated provinces as well as being geographic neighbors. While they do have a slightly lower total count, it’s clear that the provinces with the highest populations were hit the hardest by the pandemic in general.

Conclusion

While Ontario did have a rocky start to 2021 managing the virus, the province has rebounded from the third wave, keeping the province open and limiting new public health restrictions, other than the implementation of proof of vaccination. However, with many of the population either past or approaching the 6 month mark past an individual’s 2nd dose of Covid-19 vaccine, t will be interesting to see if the waning immunity will contribute to another large spike in ICU admissions. As well, the discovery of a new variant of concern recently by the WHO, labelled Omicron, on November 26, 2021,[9] it is very difficult to predict the future of this pandemic and how long the human race has until the virus can be considered endemic. Overall, Ontario has been the hardest hit province by the pandemic. While there have been questionable decisions made which did heavily impact this trajectory, like prematurely lifting the 2nd stay-at-home order before vaccine uptake was well underway in the population, the leaders have done a good job of shifting strategies and keeping both the ICU admissions and case counts to a manageable level up to this point.

Sources:
[1] Global News A timeline of COVID-19 in Ontario
[2] Global News ‘Perfect storm’: Is Canada headed for a third wave of COVID-19?
[3] CBC News Canada Why warnings of a deadly 3rd wave in Canada may have gone unheeded
[4] Ontario Newsroom Ontario Moving to Step Three of Roadmap to Reopen on July 16
[5] Centres for Disease Control and Prevention Key Things to Know About COVID-19 Vaccines
[6] Statistics Canada Table 17-10-0009-01 Population estimates, quarterly
[7] CBC News Toronto Ontario details plan for 3 paid sick days after a year of mounting pressure
[8] CBC News Ottawa The Ottawa area's weekly COVID-19 vaccine checkup: April 22
[9] World Economic Forum Everything we know so far about the Omicron COVID-19 variant

How I Did It:
  1. After doing researching and finding appropriate datasets, I extracted the columns of data I needed to separate .csv files, since you can't read from the same file at separate points.
  2. Below is the the D3.js script I used to generate the four graphs:
            
                
                    
                
  1. I then added this HTML to my programmer page which d3 generates to through my script. Note that you can use script tags to contain the JavaScript directly in the HTML. I put it in a separate .js file to make debugging easier and as a personal decision.