Web Scraping Insurance Products

Feb 28, 2018

Link to complete code on Github

Introduction

This program was written because I have to do insurance market investigation weekly at work. Information such as products discontinued, new products being launched, coverage of the newly launched policies, and so on need to be recorded and later be generated as a report. In Taiwan, it is regulated by law that the clause, loading of certain ages, and exception clauses should be disclosed publicly on the internet. While large insurance companies in Taiwan often provide 100 to 200 insurance contracts, it is not efficient to go through webpages manually to find new products or discontinued ones. I thus wrote this program to automatically generate reports weekly to track newly launched/ discontinued products in the insurance market.

Algorithm

The program can mainly be divided into two parts as follows,

1. Extracting information from the web
The program extracts all the names of products available from the site of the targeted insurance company; it then exports the information into separated CSV files according to the category of insurance (ex. life, health, personal injury, annuities, group insurance, etc) with the inputted date added at the end of the name of the CSV files.


	def open(filename):						####read csv
		with open(os.path.join(myPath, filename), 'rU') as csvfile:
			readCSV=csv.reader(csvfile, delimiter=',')
			insurances=[]
			p_final=[]
			for insurance in readCSV:
				insurances.append(" ".join(insurance))
			for word in insurances:
				insurance=word.replace(" ", "")
				p_final.append(insurance.replace(u'\ufeff', ''))
			return p_final 
	def compare(new,previous):			####compare products in two csvs
		print("Newly Launched:\n",list(set(new)-set(previous)),"\n")
		print("Discontinued:\n",list(set(previous)-set(new)),"\n")
	def request(filename,page):			###write page content into csv
		csv_file = open(os.path.join(myPath, filename),'a', newline='',encoding='utf-8-sig')
		csv_writer = csv.writer(csv_file)
		csv_writer.writerow([" ".join(date.split())]) ###leave out blank spaces in the name
		res=requests.get(page)
		#res.raise_for_status()
		#res.encoding='CJK'
		soup=bs4.BeautifulSoup(res.text,'lxml')
		for i in soup.select('.panel-header'):
			insurance =i.text
			insurance=" ".join(insurance.split())
			#print(" ".join(insurance.split()))
			csv_writer.writerow([insurance.replace(u'\ufeff', '')])
		csv_file.close()
	def setdate():
		date=input("set the date in the file name: ")
		return date

2. Comparing CSV lists
The program can compare product lists of two designated dates(ex. all types of insurances of 20180716 vs those of 20180709 ) and produce outcomes of new products that had been launched (i.e newly posted on the site)and what had been discontinued (i.e no longer on the site) within the specified period.


renew=input("to compare all lists, please input: 1 \nto compare only two lists, please input: 2\nyour input: ")
if renew=="1":
	new_file=input("Enter the date of the new file: ")
	old_file=input("Enter the date of the previous file: ")
	for i in product_type:
		a=report.open(i+new_file+'.csv')
		b=report.open(i+old_file+'.csv')
		print("\n",i)
		report.compare(a,b)
		print("\n")


elif renew=="2":
	new_file=input("Enter the name of the new file: ")
	old_file=input("Enter the name of the previous file: ")
	a=report.open(new_file)
	b=report.open(old_file)
	report.compare(a,b)

else:
	print("wrong input")



Complete Code

The complete code for this program is as following.
######for Tracking Products of China Life Insurance Group  

import requests
import os
import bs4
import csv


class report:
	def __init__(self, new, previous, date):
		self.new=new
		self.previous=previous
		self.date=date
	def set_path():						#set the path of the directory
		global myPath
		myPath=input("Enter the path of the directory: ")
		return myPath
	def open(filename):						####read csv
		with open(os.path.join(myPath, filename), 'rU') as csvfile:
			readCSV=csv.reader(csvfile, delimiter=',')
			insurances=[]
			p_final=[]
			for insurance in readCSV:
				insurances.append(" ".join(insurance))
			for word in insurances:
				insurance=word.replace(" ", "")
				p_final.append(insurance.replace(u'\ufeff', ''))
			return p_final 
	def compare(new,previous):			####compare products in two csvs
		print("Newly Launched:\n",list(set(new)-set(previous)),"\n")
		print("Discontinued:\n",list(set(previous)-set(new)),"\n")
	def request(filename,page):			###write page content into csv
		csv_file = open(os.path.join(myPath, filename),'a', newline='',encoding='utf-8-sig')
		csv_writer = csv.writer(csv_file)
		csv_writer.writerow([" ".join(date.split())]) ###leave out blank spaces in the name
		res=requests.get(page)
		#res.raise_for_status()
		#res.encoding='CJK'
		soup=bs4.BeautifulSoup(res.text,'lxml')
		for i in soup.select('.panel-header'):
			insurance =i.text
			insurance=" ".join(insurance.split())
			#print(" ".join(insurance.split()))
			csv_writer.writerow([insurance.replace(u'\ufeff', '')])
		csv_file.close()
	def setdate():
		date=input("set the date in the file name: ")
		return date

print("\n===============================")
product_type=['Life_Insurance','Annuity_Insurance','Health_Insurance','Travel_Accident_Insurance']
report.set_path()
renew=input("Renew the product lists? y/n: ")

if renew=="n":
	print("process moves to list comparison")

elif renew=="y":
	#date=input("set the date in the file name: ")
	date=report.setdate()
	print("the progam is exporting product lists from the web")	
	report.request(product_type[0]+date+'.csv','https://www.chinalife.com.tw/wps/portal/chinalife/product-overview/personal/life-refund')	
	report.request(product_type[1]+date+'.csv','https://www.chinalife.com.tw/wps/portal/chinalife/product-overview/personal/annuity-assurance')
	report.request(product_type[2]+date+'.csv','https://www.chinalife.com.tw/wps/portal/chinalife/product-overview/personal/health')
	report.request(product_type[3]+date+'.csv','https://www.chinalife.com.tw/wps/portal/chinalife/product-overview/personal/travel')

	print("product lists are prepared\n")

else:
	print("wrong input")

print("===============================")
renew=input("to compare all lists, please input: 1 \nto compare only two lists, please input: 2\nyour input: ")
if renew=="1":
	new_file=input("Enter the date of the new file: ")
	old_file=input("Enter the date of the previous file: ")
	for i in product_type:
		a=report.open(i+new_file+'.csv')
		b=report.open(i+old_file+'.csv')
		print("\n",i)
		report.compare(a,b)
		print("\n")


elif renew=="2":
	new_file=input("Enter the name of the new file: ")
	old_file=input("Enter the name of the previous file: ")
	a=report.open(new_file)
	b=report.open(old_file)
	report.compare(a,b)

else:
	print("wrong input")


Program Instructions

1. Enter the path of the directory of your files:

demo1

2. If you want to obtain the updated product lists, enter y, otherwise n (see point 4)
After entering y, input the date, and the date will automatically be added to CSV files (CSV files be separated according to the category of insurances)


demo2

3. Enter the dates of the files to be compared, and the result would display the products newly launched/ discontinued or sales channel changes within the period. From the example below, there was no product newly launched or discontinued within 2018.10.13 and 2018.10.03.

demo3

4. (continued with point 2) Users can directly compare lists of two dates or simply two CSV files without renewing product lists.
To compare all lists (i.e all kinds of insurance) of two dates, enter 1
To compare only two CSV files, enter 2 (see point 6)


demo4

5. The result of the comparison would be displayed. From the example below, we can know that there were 1 product newly launched and 1 product discontinued within 2018.07.16 and 2018.07.09.

demo5

6. (continued with point 4) Enter the two file names to be compared, and the result of the comparison would be displayed

demo6



Back