0d46593f0c
Remove commented out code, unused functions
2016-08-05 13:53:55 +02:00
1f6305087e
Removed explicit m_buf[0] from InjectionRequest
2016-04-04 14:27:56 +02:00
ab0b9d03ea
Standardize header guards
2016-03-08 22:16:02 +01:00
72c2f6beba
Remove unused variable. Remove commented out codes.
2016-02-27 16:22:01 +01:00
f10beb06e0
Remvoed InjectioNRequest::ptr_diffbotReply
2016-01-28 13:59:44 +01:00
f8e9dac6de
Removed AutoBan, dmoz & related code (Categories and related Msg/Db), scraping code.
2015-12-11 21:56:05 +01:00
0b98f2c337
Remove some unused methods/class. Minor restructuring of test files.
2015-12-10 12:12:50 +01:00
f01db79e5f
show inject requests in the spider queue table now
2015-09-11 14:16:26 -06:00
36b8d384bd
Fixes to injector script.
...
New colors and metrics on performance graph.
2015-08-13 23:29:20 -06:00
32987e76ee
Add json metadata field to page inject.
...
Fix memory leak when spidering warc files.
Add script to inject warcs from internet archives search results.
2015-06-14 20:58:41 -06:00
08e01b5ac8
fix more bugs. new injections seem somewhat stable now.
2015-05-03 21:58:26 -07:00
bc54282339
complete overhaul of injection pipeline now compiles.
...
should distribute injection requests evenly over the cluster.
uses new InjectionRequest class which sets from httprequest
using parms in Parms.cpp. and easily serializes into a udp request.
very nice. we should use this model going forward.
2015-05-03 19:07:44 -07:00
b39a065259
checkpoint #2
2015-05-03 17:51:47 -07:00
0df4abc759
checkpoint
2015-05-04 00:17:17 +00:00
5c89bde956
now all container doc logic is in xmldoc
...
and out of pageinject. compiles. needs testing.
2015-05-01 20:32:54 -07:00
0ca27638bc
checkpoint. moved warc and arc looping into xmldoc.
...
now will any container doc from pageinject into
xmldoc. simplifies pageinject.cpp a lot. and sets up
a framework for dealing with container docs.
2015-05-01 19:11:13 -07:00
e387c0f154
yay test warc injecting working
2015-04-30 18:45:46 -07:00
2479dd330d
ok, move all the warc/arc parsing/indexing logic into
...
pageinject.cpp and out of xmldoc.cpp. it makes more
sense there. since really all we need to do is download
the warc's content and it is like injecting a delimeterized
document in the loop already in pageinject.cpp.
2015-04-29 21:39:18 -07:00
45c0909cb7
injecting warc files nicely now
2015-04-29 19:55:06 -07:00
96b8197ad3
now it compiles with -m32
2014-11-10 14:45:11 -08:00
e7dd8f7956
replace long long with int64_t
2014-10-30 13:36:39 -06:00
cb32766645
fix data import function some more. added qa test.
2014-09-24 12:40:39 -07:00
538f6103d5
get qa tests working again.
...
fixed facet links.
made data import function actually work so we can
import data from one collection (files) into another.
made url filters profile compatible with UFP_ stuff.
2014-09-23 17:48:40 -07:00
2ca303b7d7
new import code copiling. now needs runtime testing and
...
qa tests.
2014-09-20 20:12:28 -07:00
d4218e01d7
inject docs that come through our squid proxy
2014-07-09 12:25:23 -07:00
4e3e4fd0d0
yay! get multidoc flatfile injection working.
2014-06-15 14:57:38 -07:00
b2923acaf1
added support for using delimeter with injections so
...
one injected file can contain multiple documents.
2014-06-15 09:10:00 -07:00
7506d66d4a
fixes for page inject
2014-06-15 08:26:27 -07:00
108c281c33
fix annoying bug when adding new parms.
2014-06-10 12:29:50 -07:00
72c6d032d8
fix query reindex on subdocuments (diffbot json blurbs)
...
so that they just put in a spiderrequest to reindex
the parent url. Added &diffbotreply= to the injection
interface so dan can provide that along with the
pageUrl he passes in with &u=
2014-05-15 14:11:12 -07:00
4606e88721
code cleanups.
...
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00
f345c35927
crawlbot fixes.
2013-10-15 16:31:59 -07:00
313eb1209e
more crawlbot fixes
2013-10-15 12:22:59 -06:00
f974d6a47b
fixes for crawlbot universal api.
2013-09-16 10:49:37 -07:00
f6e560c1f4
Initial file population.
2013-08-02 13:12:24 -07:00