-
Notifications
You must be signed in to change notification settings - Fork 1
/
Changes
209 lines (155 loc) · 6.85 KB
/
Changes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
v0.45.0 2022-03-16T09:45:28+0900
- Remove in-article ads on ebc
v0.44.0 2021-11-01T09:04:01+0900
- Minor tweaks of the extractor of www.rti.org.tw
v0.43.0 2021-07-07T21:18:11+0900
- Update the extractor of www.eventsinfocus.org
v0.42.0 2021-06-24T09:23:46+0900
- Add a site-specific extractor for yimedia.com.tw
- Update the extractor of news.pts.org.tw to adapt the updates of the website
v0.41.0 2021-03-31T21:26:10+0900
- Update the extractor of ttv to catchup with the website updates.
- Update the extractor of www.mdnkids.com to catchup with the website updates.
v0.40.0 2021-02-09T09:55:29+0900
- Improve UDN extractor and let paragraphs be split correctly
v0.39.0 2020-08-23T22:19:18+0900
- Improve the extraction of dateline of www.idn.com.tw
- Add a site-specific extractor for www.bbc.com
- Fix the format of extracted dateline of www.rti.org.tw
v0.38.0 2020-08-14T21:01:48+0800
- Improve the accuracy of extraction of journalist and dateline on news.pts.gov.tw
- Improve the recall of the extractor of news.tnn.tw
- Add a site-specific extractor for www.aljazeera.com
v0.37.0 2020-08-08T13:11:45+0800
- Add a site-specific extractor for www.penghutimes.com
- dateline is reformatted differently. The time component is no longer default to 23:59:59
v0.36.0 2020-08-07T08:56:18+0800
- Add a site-specific extractor for www.eventsinfocus.org
- Add a site-specific extractor for m.news.cctv.com
- Improve the extractor of newnet.tw
v0.35.0 2020-07-31T08:36:15+0800
- re-format the dateline extracted from www.thinkingtaiwan.com
- Handle a few special cases on chinatimes and ebc.
- Improve the extraction of www.5ch.com.tw
- Improve the recall of dateline and journalist on www.mdnkids.com
v0.34.0 2020-07-26T22:57:38+0800
- Add a site-specific extractor for www.nownews.com
- Add a site-specific extractor for www.mdnkids.com
- Add a site-specific extractor for www.ustv.com.tw
v0.33.0 2020-07-22T06:44:18+0800
- Improve the extraction of dateline and journalist for opinion.udn.com
- Start parsing dateline string and refromat them as ISO8601
v0.32.0 2020-07-18T07:23:04+0800
- Add a site-specific extractor for www.digitimes.com.tw
- Add a site-specific extractor for www.hkcna.hk
- Add a site-specific extractor for www.cw.com.tw
- Handle the English version of www.hkcnews.com
v0.31.0 2020-07-14T08:57:12+0800
- Add a site-specific extractor for newtalk.tw
- Add a site-specific extractor for talk.ltn.com.tw
v0.30.0 2020-07-12T17:10:32+0800
- Add a site-specific extractor for focustaiwan.tw
- Improve extraction of dateline and journailst for a few existing news sites.
v0.29.0 2020-07-11T18:33:51+0800
- Add a site-specific extractor for news.cctv.com
v0.28.0 2020-07-09T23:28:40+0800
- Add a site-specific extractor for www.xinhuanet.com
- Add a site-specific extractor for hk.on.cc
v0.27.0 2020-07-06T22:17:57+0800
- Add a site-specific extractor for new.ctv.com.tw
- Add a site-specific extractor for hk.crntt.com
v0.26.0 2020-07-03T18:13:05+0800
- Adjust the dateline output
v0.25.0 2020-07-02T21:40:30+0800
- Add a site-specific extractor for www.twreporter.org
- Convert extracted dateline to iso8601 format (pts)
v0.24.0 2020-06-29T08:46:10+0800
- Convert extracted dateline to iso8601 format (peopo, fountmedia)
- Improve the extraction of journalist names on SETN
- Handle an error when parsing dateline on www.idn.com.tw
v0.23.0 2020-05-15T23:10:02+0800
- Improve the extraction of journalist names on CNA, ETToday, turnnewsapp.com
v0.22.0 2020-05-11T23:40:22+0800
- Improve the extraction of journalist names
v0.21.0 2020-05-10T07:53:41+0800
- Improve the extraction of journalist names on news.tnn.tw
- Improve the extraction of journalist names on NTDTV
v0.20.0 2020-05-05T21:22:48+0800
- CTS: rewritten for quicker extraction.
v0.19.0 2020-05-04T14:01:50+0800
- UDN: Update the CSS ruleset for udn.com
- Add a site-specific extractor for www.idn.com.tw
- Niusnews: Extract journalist name
- SETN: Non-human journalist names are now extracted too.
- Properly handle non-article pages.
v0.18.0 2020-05-03T19:53:12+0800
- Add a site-specific extractor for www.ttv.com.tw
- Add a site-specific extractor for www.hkcnews.com
- Add a site-specific extractor for www.thestandnews.com
- Add a site-specific extractor for www.epochtimes.com
v0.17.0 2020-05-03T09:15:32+0800
- Improve the extraction of journalist names on cnews, EBC, CTEE and SETN
- Add a site-specific extractor for newnet.tw
v0.16.0 2020-04-26T21:36:41+0800
- Improve the extraction of journalist names on CTS, CTEE and rti.fr
v0.15.0 2020-04-23T09:17:50+0800
- Reduce the amount of warnings.
v0.14.0 2020-04-08T00:00:54+0800
- Improve the extraction of journalist names on ETToday and CNA
v0.13.0 2020-03-22T16:20:13+0800
- Improve the extraction of journalist names on www.setn.com
v0.12.0 2020-03-09T09:57:13+0900
- Improve the extraction of www.upmedia.mg
- Improve the extraction of journalist names on www.setn.com
v0.11.0 2020-02-15T09:54:00+0900
- Improve the accuracy of extracting www.taiwannews.com.tw
- Improve the extraction of news.cts.com.tw
- Add a site-specific extractor for estate.ltn.com.tw
v0.10.0 2020-02-04T09:01:00+0900
- Improve the extraction of turnnewsapp.com
- Improve coverage for various cases.
v0.9.0 2020-02-04T01:21:00+0900
- Improve the extraction of news.tnn.tw
- Improve the extraction of www.setn.com
v0.8.0 2020-02-03T09:52:00+0900
- Improve the extraction of www.rti.org.tw
- Improve the extraction of www.bcc.com.tw
v0.7.0 2020-02-02T10:28:00+0900
- Improve the extraction of www.taipeitimes.com
- Improve the extraction of udn.com
- Improve the extraction of money.udn.com
- Improve the extraction of stars.udn.com
- Improve the extraction of house.udn.com
v0.6.0 2020-02-01T19:51:00+0900
- Reject the extracted journalist name if it happens to be one of the known newspaper name.
- Improve the extraction of https://www.storm.mg
v0.5.0 2020-01-29T10:30:00+0900
- Improve the extraction of a few news site.
0.4.0 2020-01-26T01:51:45+0900
- Improve the extraction of a few news site.
0.3.0 2020-01-26T01:51:45+0900
- Improve the extraction of a few news site.
0.2.0 2020-01-25T23:23:39+0900
- Improve the extraction of a specific news site.
0.1.1 2020-01-24T11:13:11+0900
- Fix an fat-finger mistake.
0.1.0 2020-01-24T09:16:11+0900
- Improve the extraction of a few news sites.
0.0.9
- Improve the recall of a specific news site.
0.0.8 2020-01-20T22:08:39+0900
- Improve the extraction of journalist names on a few more news sites.
0.0.7
- Remove a bunch of wastes.
0.0.6
- Handle "utf-8" charset correctly.
0.0.5
- GenericExtractor can now extract directly from HTML file.
0.0.4
- Introduce JSONLD-based extractor.
0.0.3
- Introduce CSS-based site-specific extractors.
0.0.2
- Improvements on error-handling.
0.0.1
- Inital Release