Hong Kong Address Parser

An open source project for turning unstructured HK address strings into address components and coordinates, with validation.

What is the origin of this Project? 

As a journalist, sometimes we encounter a bunch of unstructured addresses, and we have to make those structural to extract address components (eg. district/road) for further analysis, geocode for visualization. However the whole cleaning process is such a mess.

Currently we can lookup addresses by the following tools: The OGCIO Address Lookup Service (ALS) and Google Geocoding API (GAPI), however there are some drawbacks:

  1. Garbage in, Garbage out: ALS always returns suggested address result, even you just type an ’A’ (try this and see what happen!)
  2. Ignore floor and unit: ALS doesn’t support floor and unit, including floor and unit address may corrupt the result
  3. Uncertain reliability: GAPI doesn’t return results if the system returns multiple matches (Try searching 將軍澳富麗花園 in Google map and see what happen!)
  4. No validation: We never know what we did wrong when the result is wrong

What social problem are you trying to solve?

How did we begin from scratch?

Making use of the libpostal library, we are going to normalize the street addresses using statistical NLP (or just using fuzzy matching for other cases). For data source, the ’correct addressing’ data from the Hong Kong Post and the street name from data.gov.hk would be used to validate the request address components.

Before parsing any address, our parser should be able to first identify and validate the address components (to check if essential address components are filled), and then pass it to ALS for geocoding and get ’official’ suggested address.

Finally, a web front end similar to usaddress parserator which support bulk parsing and geocoding would be built.

The Challenges we face?

What to do:

What resource do you need? 

libpostal - a multilingual street address parsing/normalization library

Address Lookup Service - a web service providing lookup function on Hong Kong address records in both Chinese and English aggregated from various Government Bureaux/Departments

Correct addressing - scraped from Hong Kong Post

Street names - obtained from data.gov.hk

Optional: Database of private buildings in HK (Home Affairs Department), Names of buildings (Rating and Valuation Department)

Who are we? 

Brian Leung, Reporter

brianleung1017@gmail.com

Relevent links:

【開放數據】政府「資料一線通」7成資料屬水份 一圖睇清有啲乜