An open source project for turning unstructured HK address strings into address components and coordinates, with validation.
What is the origin of this Project?
As a journalist, sometimes we encounter a bunch of unstructured addresses, and we have to make those structural to extract address components (eg. district/road) for further analysis, geocode for visualization. However the whole cleaning process is such a mess.
Currently we can lookup addresses by the following tools: The OGCIO Address Lookup Service (ALS) and Google Geocoding API (GAPI), however there are some drawbacks:
What social problem are you trying to solve?
How did we begin from scratch?
Making use of the libpostal library, we are going to normalize the street addresses using statistical NLP (or just using fuzzy matching for other cases). For data source, the ’correct addressing’ data from the Hong Kong Post and the street name from data.gov.hk would be used to validate the request address components.
Before parsing any address, our parser should be able to first identify and validate the address components (to check if essential address components are filled), and then pass it to ALS for geocoding and get ’official’ suggested address.
Finally, a web front end similar to usaddress parserator which support bulk parsing and geocoding would be built.
The Challenges we face?
What to do:
What resource do you need?
libpostal - a multilingual street address parsing/normalization library
Address Lookup Service - a web service providing lookup function on Hong Kong address records in both Chinese and English aggregated from various Government Bureaux/Departments
Correct addressing - scraped from Hong Kong Post
Street names - obtained from data.gov.hk
Optional: Database of private buildings in HK (Home Affairs Department), Names of buildings (Rating and Valuation Department)
Who are we?
Brian Leung, Reporter
brianleung1017@gmail.com
Relevent links: