111 Commits

Author SHA1 Message Date
Ivan Skytte Jørgensen
835bef157f Fix compliation of filter_titledb 2018-08-31 12:11:16 +02:00
Ivan Skytte Jørgensen
16f3fd001d Fix compliation of validate_rdbindex 2018-08-31 12:11:16 +02:00
Ivan Skytte Jørgensen
7a3eace43f fix compilation of print_urlinfo 2018-08-31 12:11:16 +02:00
Ivan Skytte Jørgensen
95005b0877 Fix compilation of dump_wordcount 2018-08-31 12:11:16 +02:00
Ivan Skytte Jørgensen
a79135e861 Fix compilation of clean_url.cpp and dump_unwanted.cpp 2018-08-31 12:11:16 +02:00
Brian Rasmusson
e383fdc5d9 added missing tokenizer include dir so we can build tools again.. 2018-08-11 12:36:42 +02:00
Ai Lin Chia
5463567d68 Add FX-Index-Code & HTTP-Status to FXARC 2018-05-08 11:52:01 +02:00
Ai Lin Chia
0318fed04d Only dump redirected titlerecs and add field to store redirected url 2018-05-08 11:23:27 +02:00
Ai Lin Chia
fc2317bc89 Fix dump name and memory leak 2018-04-26 15:00:09 +02:00
Ai Lin Chia
4616b66046 Add initial implementation of dumping titledb to customized archive file 2018-04-26 14:40:26 +02:00
Ai Lin Chia
1269bfdbea Add print version to tools 2018-04-24 15:39:40 +02:00
Ai Lin Chia
abb30bc51e Initialize domains for tools 2018-04-24 14:42:58 +02:00
Ivan Skytte Jørgensen
8a40f04f3e Fixed print_urlinfo.cpp 2018-04-24 14:17:20 +02:00
Ai Lin Chia
9674e723e4 Change list size 2018-04-20 16:23:30 +02:00
Ai Lin Chia
3872fe9ec9 Add code to verify shardnum 2018-04-20 16:14:30 +02:00
Ai Lin Chia
a437ba4f96 Check for out of range docId based on url 2018-04-20 12:08:38 +02:00
Ai Lin Chia
20d00433e7 Remove check for url with wrongly encoded sequence 2018-04-20 11:45:59 +02:00
Ai Lin Chia
5566483a8a Try to find url which may be encoded wrongly 2018-04-16 17:01:00 +02:00
Ai Lin Chia
2307a59dc2 More checks for detecting wrong ascii encoding 2018-04-16 15:14:38 +02:00
Ai Lin Chia
8750337040 Fix dump_wrong_encoding with ASCII charset 2018-04-16 13:58:00 +02:00
Ai Lin Chia
b670b56a23 Check for ascii as well (to detect probable wrong encoding) 2018-04-16 12:51:06 +02:00
Ai Lin Chia
8cfe1aa813 Don't check for more wrong encodings once we have establish the fact 2018-04-16 11:39:20 +02:00
Ai Lin Chia
ce92434b62 Add more encodings to dump_wrong_encoding 2018-04-16 11:34:04 +02:00
Ai Lin Chia
4c9ce2a1d2 Add dump_charset to tools 2018-04-13 15:35:13 +02:00
Ai Lin Chia
606a89d852 Fix compilation error 2018-04-12 11:45:25 +02:00
Ai Lin Chia
cb4cc51992 Add utf8 decoded as csWindows1257 2018-04-12 11:44:01 +02:00
Ai Lin Chia
774d7d06d9 Add german to dump wrong encoding 2018-04-11 15:32:21 +02:00
Ai Lin Chia
6eeeebe9b6 Add swedish to dump wrong encoding 2018-04-11 15:24:24 +02:00
Ai Lin Chia
05721bcb9f Fix dump_wrong_encoding tool 2018-04-11 15:14:21 +02:00
Ai Lin Chia
e2e99f7be4 Add new tool to dump wrongly detected encoding 2018-04-11 14:51:40 +02:00
Ai Lin Chia
4c73bd6e64 Fix compilation error in tools 2018-04-11 13:13:25 +02:00
Ai Lin Chia
6fcaa13fcd Fix compilation error 2018-03-05 17:47:34 +01:00
Ai Lin Chia
956fb1b002 Fix sitehash search 2018-01-18 13:11:43 +01:00
Ai Lin Chia
461f14dc5a Fix bug where host is not treated as hex 2018-01-18 11:42:48 +01:00
Ai Lin Chia
fabc0d5b72 More logs 2018-01-18 11:37:06 +01:00
Ai Lin Chia
ea65f2e1a9 Enable logs 2018-01-18 10:58:51 +01:00
Ai Lin Chia
80cb8cb956 Fix logging for verify_spiderdb 2018-01-16 16:38:03 +01:00
Ai Lin Chia
fb45f5e6ad Add verify_spiderdb tool 2018-01-16 16:02:21 +01:00
Ai Lin Chia
0b7799524a Modify dump_redirect to dump first_ip as well 2018-01-08 13:47:15 +01:00
Brian Rasmusson
30c26c2d5b add word_variations to include dir 2017-12-22 15:37:27 +01:00
Brian Rasmusson
a90a6c536e output hash values for links in get_titlerec tool 2017-12-16 11:57:36 +01:00
Ai Lin Chia
4f1ab01497 Add dump tool to dump redirected document 2017-12-13 23:02:08 +01:00
Ai Lin Chia
f4bcf46e85 More dump tools 2017-12-06 12:11:21 +01:00
Ai Lin Chia
1df27bd90a Split noindex to check nofollow as well 2017-11-24 12:35:53 +01:00
Ai Lin Chia
cf22910ce5 Fix compilation error 2017-11-23 17:04:54 +01:00
Ai Lin Chia
d7c9104ce2 More criteria for dump_unwanted tool 2017-11-23 17:02:30 +01:00
Ai Lin Chia
92ca1b77d3 Add tools/dump_unwanted 2017-11-10 15:04:12 +01:00
Ai Lin Chia
c8d166f2f8 Check for javascript tags 2017-11-07 11:56:05 +01:00
Ai Lin Chia
326f25cbb4 Don't dump redirection pages/non txt/html pages 2017-11-07 11:46:24 +01:00
Ai Lin Chia
9be65ab54e Fix g_hostdb.init call 2017-11-03 10:34:12 +01:00