PDF (1889K)
摘要
为推进数据开放过程中个人信息保护,深入分析政府开放数据中个人信息的披露现状:首先从相关平台中获取数据,并对其预处理,根据字段、表名等特征筛选出含有个人信息的数据;其次利用敏感信息识别方法识别数据中各类个人信息,并将其映射到个体,以统计个体数量同时检测其关联数据;最后通过数据可视化,直观展示个人信息披露现状。虽然部分公共数据开放平台虽然对公共数据进行分级分类以及去标识化等处理,但是已开放的数据中依旧包含大量直接展示的个人信息,需要在数据规范化分级分类、敏感信息识别和敏感信息脱敏等方面进行完善。
Abstract
To promote the protection of personal information during data opening, an in-depth analysis of the current status of disclosure of personal information in the open government data is conducted. Firstly, the paper obtains the datasets from relevant platforms and pre-process to classify the datasets that containing personal information based on features such as field and table names, etc. Then, methods of sensitive information identification are applied to identify and extract various types of personal information in the data, and map the information back to individuals to summarise the total number of individuals and detect their associated data. Through data visualizations, the current status of personal information disclosure could be examined. Although some open government data platforms may have implemented certain measures such as data categorization and de-identification, the published open datasets still contain a large amount of personal information, which is required to be improved in terms of data categorization and classification, sensitive information identification and data desensitization in a normative and accurate manner.
关键词
Key words
[Author(id=1279801051923726703, tenantId=1045748351789510663, journalId=1155139928303341749, articleId=1279771242916544563, orderNo=0, firstName=null, middleName=null, lastName=null, nameCn=null, orcid=null, stid=null, country=null, authorPic=null, dead=0, email=hschen@nanhulab.ac.cn, emailSecond=null, emailThird=null, correspondingAuthor=0, authorType=1, ext={EN=AuthorExt(id=1279801051982446961, tenantId=1045748351789510663, journalId=1155139928303341749, articleId=1279771242916544563, authorId=1279801051923726703, language=EN, stringName=Haisu CHEN, firstName=Haisu, middleName=null, lastName=CHEN, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=null, address=Research Center of Big Data Technology, Nanhu Laboratory , Jiaxing 314002, Zhejiang, China, bio=null, bioImg=null, bioContent=null, aboutCorrespAuthor=null), CN=AuthorExt(id=1279801052020195698, tenantId=1045748351789510663, journalId=1155139928303341749, articleId=1279771242916544563, authorId=1279801051923726703, language=CN, stringName=陈海粟, firstName=null, middleName=null, lastName=null, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=null, address=南湖实验室 大数据技术研究中心 , 浙江 嘉兴 314002, bio={"content":"陈海粟(1999—),男,硕士,研究方向为信息处理、智慧城市与个人信息保护. E-mail: hschen@nanhulab.ac.cn
"}, bioImg=null, bioContent=陈海粟(1999—),男,硕士,研究方向为信息处理、智慧城市与个人信息保护. E-mail: hschen@nanhulab.ac.cn
, aboutCorrespAuthor=null)}, companyList=[AuthorCompany(id=1279801051856617835, tenantId=1045748351789510663, journalId=1155139928303341749, articleId=1279771242916544563, xref=null, ext=[AuthorCompanyExt(id=1279801051873395052, tenantId=1045748351789510663, journalId=1155139928303341749, articleId=1279771242916544563, companyId=1279801051856617835, language=EN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=Research Center of Big Data Technology, Nanhu Laboratory , Jiaxing 314002, Zhejiang, China), AuthorCompanyExt(id=1279801051885977965, tenantId=1045748351789510663, journalId=1155139928303341749, articleId=1279771242916544563, companyId=1279801051856617835, language=CN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=南湖实验室 大数据技术研究中心 , 浙江 嘉兴 314002)])]), Author(id=1279801052062138740, tenantId=1045748351789510663, journalId=1155139928303341749, articleId=1279771242916544563, orderNo=1, firstName=null, middleName=null, lastName=null, nameCn=null, orcid=null, stid=null, country=null, authorPic=null, dead=0, email=jliao@nanhulab.ac.cn, emailSecond=null, emailThird=null, correspondingAuthor=1, authorType=1, ext={EN=AuthorExt(id=1279801052116664694, tenantId=1045748351789510663, journalId=1155139928303341749, articleId=1279771242916544563, authorId=1279801052062138740, language=EN, stringName=Jiachun LIAO, firstName=Jiachun, middleName=null, lastName=LIAO, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=*, address=Research Center of Big Data Technology, Nanhu Laboratory , Jiaxing 314002, Zhejiang, China, bio=null, bioImg=null, bioContent=null, aboutCorrespAuthor=null), CN=AuthorExt(id=1279801052158607735, tenantId=1045748351789510663, journalId=1155139928303341749, articleId=1279771242916544563, authorId=1279801052062138740, language=CN, stringName=廖佳纯, firstName=null, middleName=null, lastName=null, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=*, address=南湖实验室 大数据技术研究中心 , 浙江 嘉兴 314002, bio=null, bioImg=null, bioContent=null, aboutCorrespAuthor=null)}, companyList=[AuthorCompany(id=1279801051856617835, tenantId=1045748351789510663, journalId=1155139928303341749, articleId=1279771242916544563, xref=null, ext=[AuthorCompanyExt(id=1279801051873395052, tenantId=1045748351789510663, journalId=1155139928303341749, articleId=1279771242916544563, companyId=1279801051856617835, language=EN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=Research Center of Big Data Technology, Nanhu Laboratory , Jiaxing 314002, Zhejiang, China), AuthorCompanyExt(id=1279801051885977965, tenantId=1045748351789510663, journalId=1155139928303341749, articleId=1279771242916544563, companyId=1279801051856617835, language=CN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=南湖实验室 大数据技术研究中心 , 浙江 嘉兴 314002)])]), Author(id=1279801052200550777, tenantId=1045748351789510663, journalId=1155139928303341749, articleId=1279771242916544563, orderNo=2, firstName=null, middleName=null, lastName=null, nameCn=null, orcid=null, stid=null, country=null, authorPic=null, dead=0, email=null, emailSecond=null, emailThird=null, correspondingAuthor=0, authorType=1, ext={EN=AuthorExt(id=1279801052259271035, tenantId=1045748351789510663, journalId=1155139928303341749, articleId=1279771242916544563, authorId=1279801052200550777, language=EN, stringName=Sicheng YAO, firstName=Sicheng, middleName=null, lastName=YAO, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=null, address=Research Center of Big Data Technology, Nanhu Laboratory , Jiaxing 314002, Zhejiang, China, bio=null, bioImg=null, bioContent=null, aboutCorrespAuthor=null), CN=AuthorExt(id=1279801052305408380, tenantId=1045748351789510663, journalId=1155139928303341749, articleId=1279771242916544563, authorId=1279801052200550777, language=CN, stringName=姚思诚, firstName=null, middleName=null, lastName=null, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=null, address=南湖实验室 大数据技术研究中心 , 浙江 嘉兴 314002, bio=null, bioImg=null, bioContent=null, aboutCorrespAuthor=null)}, companyList=[AuthorCompany(id=1279801051856617835, tenantId=1045748351789510663, journalId=1155139928303341749, articleId=1279771242916544563, xref=null, ext=[AuthorCompanyExt(id=1279801051873395052, tenantId=1045748351789510663, journalId=1155139928303341749, articleId=1279771242916544563, companyId=1279801051856617835, language=EN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=Research Center of Big Data Technology, Nanhu Laboratory , Jiaxing 314002, Zhejiang, China), AuthorCompanyExt(id=1279801051885977965, tenantId=1045748351789510663, journalId=1155139928303341749, articleId=1279771242916544563, companyId=1279801051856617835, language=CN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=南湖实验室 大数据技术研究中心 , 浙江 嘉兴 314002)])])]
陈海粟,廖佳纯,姚思诚.
政府开放数据中个人信息披露识别与统计方法[J].
山东大学学报(理学版), 2024, 59(03): 95-106 DOI:10.6040/j.issn.1671-9352.7.2023.2681
| [1] |
梅宏. 数据治理之路:贵州实践[M]. 北京:中国人民大学出版社, 2022: 47.
|
| [2] |
MEI Hong. On data governance: practice in Guizhou[M]. Beijing: China Renmin University Press, 2022: 47.
|
| [3] |
国务院. 国务院关于印发促进大数据发展行动纲要的通知[EB/OL]. (2015—09—05) [2023—02—12]. https://www.gov.cn/zhengce/content/2015—09/05/content_10137.htm.
|
| [4] |
The State Council . Circular of the state council on printing and issuing the action outline for promoting the big data development [EB/OL]. (2015—09—05) [2023—02—12]. https://www.gov.cn/zhengce/content/2015—09/05/content_10137.htm.
|
| [5] |
复旦大学数字与移动治理实验室. 中国地方政府数据开放报告—城市指数(2022年度)[R/OL]. (2023—01—10) [ 2023—01—30]. http://ifopendata.fudan.edu.cn/report.
|
| [6] |
DMG Lab Fudan University . China’s local government open data report—city index (2022)[R/OL]. (2023—01—10) [ 2023—01—30]. http://ifopendata.fudan.edu.cn/report
|
| [7] |
黄玥, 周丽霞, 蒲攀. 基于AHP方法的我国信息安全政策方案优化决策研究[J]. 现代情报, 2015, 35(3): 77-81.
|
| [8] |
HUANG Yue, ZHOU Lixia, PU Pan. Study on the optimizing of information security policy based on AHP[J]. Journal of Modern Information, 2015, 35(3): 77-81.
|
| [9] |
周林兴, 周丽. 政府数据开放中的隐私信息治理研究[J]. 图书馆学研究, 2019(12): 41-47.
|
| [10] |
ZHOU Linxing, ZHOU Li. Research on privacy information governance in open government data[J]. Research on Library Science, 2019(12): 41-47.
|
| [11] |
李立新, 唐培洪, 臧滔, 等 . 一种身份证号码识别方法、装置和电子设备:CN112380211A[P]. 2021—02—19.
|
| [12] |
LI Lixin, TANG Peihong, ZANG Tao, et al. The invention relates to a method, a device and an electronic device for the identification of resident identity card number: CN112380211A[P]. 2021—02—19.
|
| [13] |
闫萍. 基于规则和概率统计相结合的中文命名实体识别研究[J]. 计算机与数字工程, 2011, 39(9): 88-91.
|
| [14] |
YAN Ping. Research on the identification for Chinese named entity based on combination of rules and statistic analysis[J]. Computer & Digital Engineering, 2011, 39(9): 88-91.
|
| [15] |
俞鸿魁, 张华平, 刘群, 等 . 基于层叠隐马尔可夫模型的中文命名实体识别[J]. 通信学报, 2006(2): 87-94.
|
| [16] |
YU Hongkui, ZHANG Huaping, LIU Qun, et al. Chinese named entity identification using cascaded hidden Markov model [J]. Journal on Communications, 2006(2): 87-94.
|
| [17] |
GUILLAUME L, BALLESTEROS M, SUBRAMANIAN S, et al. Neural architectures for named entity recognition[C] // Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego: Association for Computational Linguistics, 2016: 260-270.
|
| [18] |
孙瑞英, 李杰茹. 我国政府数据开放平台个人隐私保护政策评价研究[J]. 图书情报工作, 2022, 66(12): 3-16.
|
| [19] |
SUN Ruiying, LI Jieru. Research on the evaluation of personal privacy protection policies of government data open platforms in China[J]. Library and Information Service, 2022, 66(12): 3-16.
|
| [20] |
杜荷花. 我国政府数据开放平台隐私保护评价体系构建研究[J]. 情报杂志, 2020, 39(3): 172-179.
|
| [21] |
DU Hehua. On construction of privacy protection evaluation system of government data open platform in China[J]. Journal of Intelligence, 2020, 39(3): 172-179.
|
| [22] |
SWEENEY L. K—anonymity: a model for protecting privacy[J]. International Journal of Uncertainty, Fuzziness and Knowledge—Based Systems, 2002, 10(5): 557-570.
|
| [23] |
LEE J S, JUN S P. Privacy—preserving data mining for open government data from heterogeneous sources[J]. Government Information Quarterly, 2021, 38(1): 101544.
|
| [24] |
全国信息安全标准化技术委员会. 信息安全技术—个人信息去标识化指南: GB/T 37964—2019[S]. 北京:中国标准出版社, 2019.
|
| [25] |
National Information Security Standardization Technical Committee. Information security technology—guide for de—identifying personal information: GB/T 37964—2019[S]. Beijing: Standards Press of China, 2019.
|
| [26] |
全国信息安全标准化技术委员会秘书处. 网络安全标准实践指南—网络数据分级分类指引[EB/OL]. (2021—12—31) [2023—01—30]. https://www.tc260.org.cn/upload/2021—12—31/1640948142376022576.pdf.
|
| [27] |
The Secretariat of National Information Security Standardization Technical Committee. Practice guide on network security standards—guidelines on classification of network data[EB/OL]. (2021—12—31) [2023—01—30]. https://www.tc260.org.cn/upload/2021—12—31/1640948142376022576.pdf.
|
| [28] |
JIAO Zhenyu, SUN Shuqi, SUN Ke. Chinese lexical analysis with deep Bi—GRU—CRF network[EB/OL]. (2018—06—05) [2023—01—30]. https://doi.org/10.48550/arXiv.1807.01882.
|
| [29] |
HE H, CHOI J D. The stem cell hypothesis: dilemma behind multi—task learning with transformer encoders[C] // Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Punta Cana, Dominion: Association for Computational Linguistics, 2021: 5555-5577.
|
基金资助
南湖实验室小微课题资助项目(NSS2023C2002)