HK Court Lists Archive

最後編輯：2018-11-29 建立：2018-06-22 歷史紀錄

SELINACA web application that automatically scrapes and archives Hong Kong court lists daily, the front-end application would offer the court list data as a searchable database, to the public for free, without restrictions of use.

What is the origin of this Project?

Hong Kong has a very limited amount of open legal data. Currently, http://legalref.judiciary.gov.hk offers a very limited amount of judgement and court documents, such as High Court judgement. Other private, pay-walled services like D-Law exist, but data is patchy and expensive (>$100 per document order).

All court cases are otherwise recorded in the court lists, as soon as the cases enter the justice system, at the "Mention." Currently, the court lists are available for 7 days only: 3 days before and after the current day. There is no publicly available archive.

As a matter of principal, justice could not exist without transparency. Open legal data is a crucial to a sound justice system.

What social problem are you trying to solve?

Journalists often learn of a case after that short window limit, such as from the GIS system, with limited information on the case. Once past the window, it would be impossible to search for the individual or the organization's name, case number, date, nature of charge.

The web app would be very useful not only to journalists hoping to pursue a case, or research an individual or an organization's background. It would also be useful to due diligence professionals, legal professionals, and the public in general.

How do we begin from scratch?

The Challenges we face?

There may be challenges from by government or organization's on violation of privacy (although the only private info would be the name.)
There may be government restriction on the use of legal data
Long-term archive maintenance
Long-term server space, and possibly server maintenance
$

What to do:

Seek legal advice on privacy issues
Build a scraper, possibly with the help of existing open source tools, at fixed daily intervals
Build a database to store the data scraped
Build a front-end web application, with data entry points: search by parties' name, date, court, nature of charge, etc. (Ref: Pacer.gov) then offer a full list of data available.
Long-term database maintenance
Might need fundraising efforts to hire coders for longer-term development, and server space rental

What resource do you need?

python ==> scrapy

manual => frequency

error > retry

server space estimation, data compression

SQL database for managing large datasets

crawl from different levels : e.g.

morph.io

10,000 characters = 10kb per court per day

10,000 characters x 20 courts x 5 days x 50 weeks

= 50,000,000 characters

= approx 50 mb per year

computing speed as data accumulates in the long term?

OMAR K Progress

demo using morph.io (by Omar K.)

https://morph.io/oktak/daily_caulist (source code: https://github.com/oktak/daily_caulist)

This is very preliminary prototype. It currently tackles only one of 29 court case hearing lists.
morph.io supports scraping the sites once per day, and provide download CSV and SQLite for further storage
TODO: extend the scrape.py to tackle all courts case hearing list and setup to permanent Database hosting server

wget (by Kennon Wong)

SELINACRelevent links:

https://www.judiciary.hk/tc/crt_lists/daily_caulist.htm

https://www.d-law.com/