Ivan Skytte Jørgensen
|
835bef157f
|
Fix compliation of filter_titledb
|
2018-08-31 12:11:16 +02:00 |
|
Ivan Skytte Jørgensen
|
16f3fd001d
|
Fix compliation of validate_rdbindex
|
2018-08-31 12:11:16 +02:00 |
|
Ivan Skytte Jørgensen
|
7a3eace43f
|
fix compilation of print_urlinfo
|
2018-08-31 12:11:16 +02:00 |
|
Ivan Skytte Jørgensen
|
95005b0877
|
Fix compilation of dump_wordcount
|
2018-08-31 12:11:16 +02:00 |
|
Ivan Skytte Jørgensen
|
a79135e861
|
Fix compilation of clean_url.cpp and dump_unwanted.cpp
|
2018-08-31 12:11:16 +02:00 |
|
Brian Rasmusson
|
e383fdc5d9
|
added missing tokenizer include dir so we can build tools again..
|
2018-08-11 12:36:42 +02:00 |
|
Ai Lin Chia
|
5463567d68
|
Add FX-Index-Code & HTTP-Status to FXARC
|
2018-05-08 11:52:01 +02:00 |
|
Ai Lin Chia
|
0318fed04d
|
Only dump redirected titlerecs and add field to store redirected url
|
2018-05-08 11:23:27 +02:00 |
|
Ai Lin Chia
|
fc2317bc89
|
Fix dump name and memory leak
|
2018-04-26 15:00:09 +02:00 |
|
Ai Lin Chia
|
4616b66046
|
Add initial implementation of dumping titledb to customized archive file
|
2018-04-26 14:40:26 +02:00 |
|
Ai Lin Chia
|
1269bfdbea
|
Add print version to tools
|
2018-04-24 15:39:40 +02:00 |
|
Ai Lin Chia
|
abb30bc51e
|
Initialize domains for tools
|
2018-04-24 14:42:58 +02:00 |
|
Ivan Skytte Jørgensen
|
8a40f04f3e
|
Fixed print_urlinfo.cpp
|
2018-04-24 14:17:20 +02:00 |
|
Ai Lin Chia
|
9674e723e4
|
Change list size
|
2018-04-20 16:23:30 +02:00 |
|
Ai Lin Chia
|
3872fe9ec9
|
Add code to verify shardnum
|
2018-04-20 16:14:30 +02:00 |
|
Ai Lin Chia
|
a437ba4f96
|
Check for out of range docId based on url
|
2018-04-20 12:08:38 +02:00 |
|
Ai Lin Chia
|
20d00433e7
|
Remove check for url with wrongly encoded sequence
|
2018-04-20 11:45:59 +02:00 |
|
Ai Lin Chia
|
5566483a8a
|
Try to find url which may be encoded wrongly
|
2018-04-16 17:01:00 +02:00 |
|
Ai Lin Chia
|
2307a59dc2
|
More checks for detecting wrong ascii encoding
|
2018-04-16 15:14:38 +02:00 |
|
Ai Lin Chia
|
8750337040
|
Fix dump_wrong_encoding with ASCII charset
|
2018-04-16 13:58:00 +02:00 |
|
Ai Lin Chia
|
b670b56a23
|
Check for ascii as well (to detect probable wrong encoding)
|
2018-04-16 12:51:06 +02:00 |
|
Ai Lin Chia
|
8cfe1aa813
|
Don't check for more wrong encodings once we have establish the fact
|
2018-04-16 11:39:20 +02:00 |
|
Ai Lin Chia
|
ce92434b62
|
Add more encodings to dump_wrong_encoding
|
2018-04-16 11:34:04 +02:00 |
|
Ai Lin Chia
|
4c9ce2a1d2
|
Add dump_charset to tools
|
2018-04-13 15:35:13 +02:00 |
|
Ai Lin Chia
|
606a89d852
|
Fix compilation error
|
2018-04-12 11:45:25 +02:00 |
|
Ai Lin Chia
|
cb4cc51992
|
Add utf8 decoded as csWindows1257
|
2018-04-12 11:44:01 +02:00 |
|
Ai Lin Chia
|
774d7d06d9
|
Add german to dump wrong encoding
|
2018-04-11 15:32:21 +02:00 |
|
Ai Lin Chia
|
6eeeebe9b6
|
Add swedish to dump wrong encoding
|
2018-04-11 15:24:24 +02:00 |
|
Ai Lin Chia
|
05721bcb9f
|
Fix dump_wrong_encoding tool
|
2018-04-11 15:14:21 +02:00 |
|
Ai Lin Chia
|
e2e99f7be4
|
Add new tool to dump wrongly detected encoding
|
2018-04-11 14:51:40 +02:00 |
|
Ai Lin Chia
|
4c73bd6e64
|
Fix compilation error in tools
|
2018-04-11 13:13:25 +02:00 |
|
Ai Lin Chia
|
6fcaa13fcd
|
Fix compilation error
|
2018-03-05 17:47:34 +01:00 |
|
Ai Lin Chia
|
956fb1b002
|
Fix sitehash search
|
2018-01-18 13:11:43 +01:00 |
|
Ai Lin Chia
|
461f14dc5a
|
Fix bug where host is not treated as hex
|
2018-01-18 11:42:48 +01:00 |
|
Ai Lin Chia
|
fabc0d5b72
|
More logs
|
2018-01-18 11:37:06 +01:00 |
|
Ai Lin Chia
|
ea65f2e1a9
|
Enable logs
|
2018-01-18 10:58:51 +01:00 |
|
Ai Lin Chia
|
80cb8cb956
|
Fix logging for verify_spiderdb
|
2018-01-16 16:38:03 +01:00 |
|
Ai Lin Chia
|
fb45f5e6ad
|
Add verify_spiderdb tool
|
2018-01-16 16:02:21 +01:00 |
|
Ai Lin Chia
|
0b7799524a
|
Modify dump_redirect to dump first_ip as well
|
2018-01-08 13:47:15 +01:00 |
|
Brian Rasmusson
|
30c26c2d5b
|
add word_variations to include dir
|
2017-12-22 15:37:27 +01:00 |
|
Brian Rasmusson
|
a90a6c536e
|
output hash values for links in get_titlerec tool
|
2017-12-16 11:57:36 +01:00 |
|
Ai Lin Chia
|
4f1ab01497
|
Add dump tool to dump redirected document
|
2017-12-13 23:02:08 +01:00 |
|
Ai Lin Chia
|
f4bcf46e85
|
More dump tools
|
2017-12-06 12:11:21 +01:00 |
|
Ai Lin Chia
|
1df27bd90a
|
Split noindex to check nofollow as well
|
2017-11-24 12:35:53 +01:00 |
|
Ai Lin Chia
|
cf22910ce5
|
Fix compilation error
|
2017-11-23 17:04:54 +01:00 |
|
Ai Lin Chia
|
d7c9104ce2
|
More criteria for dump_unwanted tool
|
2017-11-23 17:02:30 +01:00 |
|
Ai Lin Chia
|
92ca1b77d3
|
Add tools/dump_unwanted
|
2017-11-10 15:04:12 +01:00 |
|
Ai Lin Chia
|
c8d166f2f8
|
Check for javascript tags
|
2017-11-07 11:56:05 +01:00 |
|
Ai Lin Chia
|
326f25cbb4
|
Don't dump redirection pages/non txt/html pages
|
2017-11-07 11:46:24 +01:00 |
|
Ai Lin Chia
|
9be65ab54e
|
Fix g_hostdb.init call
|
2017-11-03 10:34:12 +01:00 |
|