Commit Graph

35 Commits

Author SHA1 Message Date
0d46593f0c Remove commented out code, unused functions 2016-08-05 13:53:55 +02:00
1f6305087e Removed explicit m_buf[0] from InjectionRequest 2016-04-04 14:27:56 +02:00
ab0b9d03ea Standardize header guards 2016-03-08 22:16:02 +01:00
72c2f6beba Remove unused variable. Remove commented out codes. 2016-02-27 16:22:01 +01:00
f10beb06e0 Remvoed InjectioNRequest::ptr_diffbotReply 2016-01-28 13:59:44 +01:00
f8e9dac6de Removed AutoBan, dmoz & related code (Categories and related Msg/Db), scraping code. 2015-12-11 21:56:05 +01:00
0b98f2c337 Remove some unused methods/class. Minor restructuring of test files. 2015-12-10 12:12:50 +01:00
f01db79e5f show inject requests in the spider queue table now 2015-09-11 14:16:26 -06:00
36b8d384bd Fixes to injector script.
New colors and metrics on performance graph.
2015-08-13 23:29:20 -06:00
32987e76ee Add json metadata field to page inject.
Fix memory leak when spidering warc files.
Add script to inject warcs from internet archives search results.
2015-06-14 20:58:41 -06:00
08e01b5ac8 fix more bugs. new injections seem somewhat stable now. 2015-05-03 21:58:26 -07:00
bc54282339 complete overhaul of injection pipeline now compiles.
should distribute injection requests evenly over the cluster.
uses new InjectionRequest class which sets from httprequest
using parms in Parms.cpp. and easily serializes into a udp request.
very nice. we should use this model going forward.
2015-05-03 19:07:44 -07:00
b39a065259 checkpoint 2015-05-03 17:51:47 -07:00
0df4abc759 checkpoint 2015-05-04 00:17:17 +00:00
5c89bde956 now all container doc logic is in xmldoc
and out of pageinject. compiles. needs testing.
2015-05-01 20:32:54 -07:00
0ca27638bc checkpoint. moved warc and arc looping into xmldoc.
now will any container doc from pageinject into
xmldoc. simplifies pageinject.cpp a lot. and sets up
a framework for dealing with container docs.
2015-05-01 19:11:13 -07:00
e387c0f154 yay test warc injecting working 2015-04-30 18:45:46 -07:00
2479dd330d ok, move all the warc/arc parsing/indexing logic into
pageinject.cpp and out of xmldoc.cpp. it makes more
sense there. since really all we need to do is download
the warc's content and it is like injecting a delimeterized
document in the loop already in pageinject.cpp.
2015-04-29 21:39:18 -07:00
45c0909cb7 injecting warc files nicely now 2015-04-29 19:55:06 -07:00
96b8197ad3 now it compiles with -m32 2014-11-10 14:45:11 -08:00
e7dd8f7956 replace long long with int64_t 2014-10-30 13:36:39 -06:00
cb32766645 fix data import function some more. added qa test. 2014-09-24 12:40:39 -07:00
538f6103d5 get qa tests working again.
fixed facet links.
made data import function actually work so we can
import data from one collection (files) into another.
made url filters profile compatible with UFP_ stuff.
2014-09-23 17:48:40 -07:00
2ca303b7d7 new import code copiling. now needs runtime testing and
qa tests.
2014-09-20 20:12:28 -07:00
d4218e01d7 inject docs that come through our squid proxy 2014-07-09 12:25:23 -07:00
4e3e4fd0d0 yay! get multidoc flatfile injection working. 2014-06-15 14:57:38 -07:00
b2923acaf1 added support for using delimeter with injections so
one injected file can contain multiple documents.
2014-06-15 09:10:00 -07:00
7506d66d4a fixes for page inject 2014-06-15 08:26:27 -07:00
108c281c33 fix annoying bug when adding new parms. 2014-06-10 12:29:50 -07:00
72c6d032d8 fix query reindex on subdocuments (diffbot json blurbs)
so that they just put in a spiderrequest to reindex
the parent url. Added &diffbotreply= to the injection
interface so dan can provide that along with the
pageUrl he passes in with &u=
2014-05-15 14:11:12 -07:00
4606e88721 code cleanups.
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00
f345c35927 crawlbot fixes. 2013-10-15 16:31:59 -07:00
313eb1209e more crawlbot fixes 2013-10-15 12:22:59 -06:00
f974d6a47b fixes for crawlbot universal api. 2013-09-16 10:49:37 -07:00
f6e560c1f4 Initial file population. 2013-08-02 13:12:24 -07:00