Commit Graph

99 Commits

Author SHA1 Message Date
4cfb934aaf Addressed format-overflow compiler warnings 2021-06-20 08:16:15 +00:00
6d7e14d2a4 Corrects compiler warning: C++11 requires a space between string literal and macro 2021-06-19 17:17:06 +00:00
b1ace63607 codespell: spelling corrections 2021-05-06 01:52:55 +10:00
fe448173d5 Merge branch 'ia' into testing 2015-11-09 11:14:00 -07:00
aeca57e9f4 Pass in the buffer size of an injection request so that if the content
length header field is bigger than the actual buffer we won't index
random memory.  Fixes bug with truncated warc captures.
2015-10-28 00:38:08 -06:00
3e19d43aa5 fix core 2015-10-14 12:03:12 -06:00
a4901431be a couple little fixes to pass smokes 2015-10-14 11:53:05 -06:00
1708d0608c some fixes for detecting corrupted injection requests.
seems to be very common.
2015-10-07 21:47:10 -06:00
eefbe95ce9 Merge branch 'diffbot-testing' into ia-zak 2015-09-21 10:13:29 -06:00
bcdecc63c6 expose "urlip" injection parm to provide ip of url
being injected to save gigablast from an ip lookup
if you want.
2015-09-16 09:43:15 -06:00
32d7f5cb97 better warc injection load balancing 2015-09-15 15:04:26 -06:00
e2c61c7a78 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak 2015-09-11 14:22:27 -06:00
f01db79e5f show inject requests in the spider queue table now 2015-09-11 14:16:26 -06:00
a270e163de Fix coring on udp timeout when clustering search results.
Add ability to force update a list of items in warc injector.
2015-09-11 11:05:57 -06:00
e7f1c75855 Add logic to limit number of msg7s to 100 per hosts, then we drop the
requests.
2015-09-03 22:17:16 -06:00
36b8d384bd Fixes to injector script.
New colors and metrics on performance graph.
2015-08-13 23:29:20 -06:00
15eb7f659d Fix some malformed html on hosts page.
Fix core when no collection record in injection request.
Add a script to test disk speed.
2015-07-16 12:02:14 -06:00
46af0e1bce if url too long return the EURLTOOBIG error code.
it prints 'Too many chars in url' as the official error msg.
2015-07-08 21:36:18 -06:00
815bd7ce0a quite a few bug fixes. 2015-07-02 17:42:05 -06:00
32987e76ee Add json metadata field to page inject.
Fix memory leak when spidering warc files.
Add script to inject warcs from internet archives search results.
2015-06-14 20:58:41 -06:00
ee5ffef834 fix core 2015-05-05 02:53:42 +00:00
08e01b5ac8 fix more bugs. new injections seem somewhat stable now. 2015-05-03 21:58:26 -07:00
ff969d92bb can inject a single doc now 2015-05-03 21:14:28 -07:00
bc54282339 complete overhaul of injection pipeline now compiles.
should distribute injection requests evenly over the cluster.
uses new InjectionRequest class which sets from httprequest
using parms in Parms.cpp. and easily serializes into a udp request.
very nice. we should use this model going forward.
2015-05-03 19:07:44 -07:00
b39a065259 checkpoint 2015-05-03 17:51:47 -07:00
0df4abc759 checkpoint 2015-05-04 00:17:17 +00:00
b0abe597e7 more fixes from qa test. 2015-05-02 14:34:07 -07:00
16b73a9bdd now we pass both injection tests in qa.cpp 2015-05-02 12:32:13 -07:00
ecb6d081d5 fix indexArc() 2015-05-01 23:24:40 -07:00
5c89bde956 now all container doc logic is in xmldoc
and out of pageinject. compiles. needs testing.
2015-05-01 20:32:54 -07:00
0ca27638bc checkpoint. moved warc and arc looping into xmldoc.
now will any container doc from pageinject into
xmldoc. simplifies pageinject.cpp a lot. and sets up
a framework for dealing with container docs.
2015-05-01 19:11:13 -07:00
ce030fcfb0 now .arc and .arc.gz injections work 2015-04-30 20:25:26 -07:00
b4d0c53904 fix single url injects 2015-04-30 19:09:07 -07:00
fbfdde5195 fix for old delimeterized injects. was coring in gb smokes. 2015-04-30 19:07:12 -07:00
e387c0f154 yay test warc injecting working 2015-04-30 18:45:46 -07:00
f1663402d9 compiles again now 2015-04-30 18:23:46 -07:00
2479dd330d ok, move all the warc/arc parsing/indexing logic into
pageinject.cpp and out of xmldoc.cpp. it makes more
sense there. since really all we need to do is download
the warc's content and it is like injecting a delimeterized
document in the loop already in pageinject.cpp.
2015-04-29 21:39:18 -07:00
45c0909cb7 injecting warc files nicely now 2015-04-29 19:55:06 -07:00
21948e15f6 more fixes 2015-04-28 23:30:14 -07:00
9370c8f52e more fixes 2015-04-28 23:20:16 -07:00
0eb415d408 added preliminary support for spidering .warc.gz and .arc.gz files 2015-04-27 21:41:22 -06:00
38caa517f2 add switches to disable injections or querying
from the master controls, for all collections.
2015-03-04 10:49:37 -08:00
b89f071f7c quite a few bug fixes from adding the new query
syntax qa test.
2014-12-11 18:24:28 -08:00
0460335861 more permission system updates 2014-12-08 09:49:17 -08:00
a7462ed1f4 fix injection stuff 2014-12-04 09:29:17 -07:00
96b8197ad3 now it compiles with -m32 2014-11-10 14:45:11 -08:00
e7dd8f7956 replace long long with int64_t 2014-10-30 13:36:39 -06:00
29f928a71e import fixes 2014-09-25 20:48:34 -07:00
d4182cf4ed fix importing function some 2014-09-25 20:33:42 -07:00
fce036868b only host should read the import data. 2014-09-25 07:55:30 -07:00