c77453348fMerge branch 'master' into diffbot
Matt Wells
2013-09-18 09:23:48 -07:00
d6815f2c9dif family filter enabled (&ff=1) then prepend "gbadult:0 |" to the query to restrict to non-adult pages.
mwells
2013-09-18 00:11:55 -06:00
a0032e0eb7added another log statement for when debugging the adult content detectory. we err on the side of caution for the most part.
mwells
2013-09-18 00:06:21 -06:00
fc692202bafix integration of urls filters into crawlbot page
Matt Wells
2013-09-16 16:27:48 -07:00
e7ed9254d4formatting...
Matt Wells
2013-09-16 15:33:45 -07:00
1a780d1f4apretty up a little
Matt Wells
2013-09-16 15:18:55 -07:00
a034604cefclean up to remove g_conf.m_useDiffbot
Matt Wells
2013-09-16 15:00:43 -07:00
cb9969ad22fix token bug
Matt Wells
2013-09-16 14:38:29 -07:00
3dfba4de69doc updates
Matt Wells
2013-09-16 14:29:01 -07:00
4c11265a98more updates to crawlbot api
Matt Wells
2013-09-16 13:59:11 -07:00
676437c3c4more universal api updates.
Matt Wells
2013-09-16 11:42:04 -07:00
04f7774543lower spider crawl info stats threshold
Matt Wells
2013-09-16 11:27:09 -07:00
df96f81e78fix spidering and other things.
Matt Wells
2013-09-16 11:22:07 -07:00
f974d6a47bfixes for crawlbot universal api.
Matt Wells
2013-09-16 10:49:37 -07:00
a50898649bvarious fixes.
Matt Wells
2013-09-16 10:16:49 -07:00
9db501d91cresolve merge conflict for nullTerm()
Matt Wells
2013-09-16 09:06:33 -07:00
78a334198bMerge branch 'master' into diffbot
Matt Wells
2013-09-16 09:05:37 -07:00
3ac79de92efix type adurl -> addurl.
Matt Wells
2013-09-16 08:11:06 -07:00
e6f87f5049do not send email alerts to sysadmin@gigablast.
Matt Wells
2013-09-16 08:10:18 -07:00
5deda56edeminor documentation updates.
Matt Wells
2013-09-15 22:16:14 -07:00
3fdbae4b05admin.html documentation update.
Matt Wells
2013-09-15 22:05:01 -07:00
68db2e6cc6fix bug when checking the delete checkbox on the injection page.
Matt Wells
2013-09-15 21:47:42 -07:00
965e23f192fix core from hashtablex::set() not getting enough buf space. now we force it to allocate a minimum of 32 slots to fix another bug where it was re-allocating immediately upon adding a key because growTable() is ALWAYS called if there are less than 20 slots!
Matt Wells
2013-09-15 21:15:58 -07:00
991e2f30f7speed up whitelist hashtable like 20x using hashtable key magic.
Matt Wells
2013-09-15 21:10:53 -07:00
928dc36a03get "&site=abc.com+xyz.com"... working to restrict search results to specified sites. tested a little.
Matt Wells
2013-09-15 20:16:48 -07:00
2211881e59take apt-get install ssl stuff out of admin.html installation instructions since we supply the ssl headers now.
mwells
2013-09-15 18:27:47 -06:00
01c2a6d381we already include our own 32-bit libssl.a and libcrypto.a so we can ensure stability. so we have to include the header files as well really.
mwells
2013-09-15 18:25:49 -06:00
107037c6a2new &sites=xyz.com+abc.com+... functionality compiles ok.
mwells
2013-09-15 18:14:32 -06:00
b684414e16almost done adding support for whitelists. i.e. list of sites to restrict search results to, for instance.
mwells
2013-09-15 15:15:56 -06:00
7ecffec40funiversal api updates
Matt Wells
2013-09-13 18:10:03 -07:00
d982997b0cstreamline crawl stats.
Matt Wells
2013-09-13 17:34:39 -07:00
93ce424d99start working on the main gui for crawlbot which is /crawlbot
Matt Wells
2013-09-13 16:22:07 -07:00
6b330da240cleanup warnings in log.
Matt Wells
2013-09-13 14:37:35 -07:00
eb65b9265dcall diffbot /api/analyze if classify is true or api = "all" now. it will return "type": in the json to indicate page type. basically, it classifies the page.
Matt Wells
2013-09-13 14:13:56 -07:00
19056fc3f2show "processed" instead of "matched". other fixes for spider stats. add new crawl stats. attempts and successes.
Matt Wells
2013-09-13 11:51:55 -07:00
e3e6551e23fix diffbot bugs.
Matt Wells
2013-09-13 11:34:40 -07:00
7dd647c222trying to fix nukeJSON code.
Matt Wells
2013-09-13 10:38:34 -07:00
ef3990da98now when re-indexing an existing xml doc we first call nukeJSONObjects() to delete any pages that it indexed from the json objects it had in its diffbot reply. in this way it can then re-add them if its new diffbot reply has them again.
Matt Wells
2013-09-13 10:00:21 -07:00
a412c798bfMerge branch 'master' into diffbot
Matt Wells
2013-09-13 09:24:28 -07:00
5dc7bd2ab4integrate diffbot from svn back into git.
Matt Wells
2013-09-13 09:23:18 -07:00
1d63aa936cremove plotter.h includes causing compiler errors on some machines.
Matt Wells
2013-09-09 01:25:00 -07:00
76b390aea2fix typo
Matt Wells
2013-09-08 19:51:57 -07:00
d930a833cctry to fix compiler error related to bad delete function override. added "throw()" before the first "{" in the function body.
mwells
2013-09-08 20:15:39 -06:00
828345a4c7fix compiler warning in types.h.
mwells
2013-09-08 20:00:52 -06:00
e1968b2237Merge branch 'master' of git@github.com:gigablast/open-source-search-engine
Matt Wells
2013-09-08 18:43:00 -07:00
cecc655eacminor ifdef fix.
Matt Wells
2013-09-08 18:42:44 -07:00
657d669ec8exclude events and seo functionality. most people want this for web search so it should be a non-issue.
mwells
2013-09-08 17:07:42 -06:00
34b6d3e74afixed some cores. brought in fixes from old repo.
mwells
2013-09-08 16:16:13 -06:00
dcf45dd69ddump out doledb to disk when it has more than 50,000 negative keys to avoid positive/negative key annihilations delays.
mwells
2013-09-08 15:09:54 -06:00
03706131fedocumentation updates in Spider.h.
Matt Wells
2013-09-08 13:42:02 -07:00
54c9353dbdtry to fix core from g_inSigHandler being set. it should never be set since we do not use real time signals any more
mwells
2013-09-08 12:34:37 -06:00
0581f86265fix core from calling a gettime related function from a pthread when a signal handler from the main thread was in use and POSSIBLY in the same function when the signal went off. different threads should be able to access that function just fine i'd imagine.
mwells
2013-09-06 15:39:53 -06:00
7aa81abf91use the "onsite" keyword in your url filters instead of this "only spider links from same host" switch to keep things simpler.
mwells
2013-09-06 09:37:17 -06:00
c58df10155fix major bug causing spiders not to work.
Matt Wells
2013-09-04 11:01:24 -07:00
91c4e768b1more family filter fixes
mwells
2013-09-01 18:28:49 -06:00
aaf333c46ctry to get family filter (&ff=1) working again to filter out adult search results.
mwells
2013-09-01 18:22:38 -06:00
afbd1e2b96fix core from trying to get the time while in a sig handler. getTime() is not async safe.
mwells
2013-09-01 12:55:22 -06:00
93dfb0cfd4fix for the "spiders stuck" fix.
mwells
2013-08-31 11:25:26 -06:00
af46945403show more info when dumping doledb.
mwells
2013-08-31 10:55:05 -06:00
9696c7936aMerge branch 'master' into diffbot
Matt Wells
2013-08-30 16:33:00 -07:00
94e6492916removed MAX_COLL_RECS so we can have unlimited collections, really limited by the sizeof(collnum_t) only now, which is 16bits, 15bits unsigned, which is the limitation. can always expand this so we can have more than 32k collections.
Matt Wells
2013-08-30 16:20:38 -07:00
900bbf8fbatry to fix the bug of the spiders kinda getting stuck and now spidering to their max potential because of doledb record annihilations at the top of the spider priority queue in spiderdb of SpiderRequests. was causing lots of re-reads in Msg5.cpp of doledb, like over 300 rounds, very slow.
mwells
2013-08-29 21:59:02 -06:00
2e9c8f7c6eMerge branch 'master' of github.com:gigablast/open-source-search-engine
mwells
2013-08-29 21:17:46 -06:00
84fae9a3c6Fix issue of reading spiderrequests from doledb at the very first key in spiderdb. causes lots of positive/negative key annihilations. we end up re-reading like 300 times in some cases just to get a url from a doledb priority.
mwells
2013-08-29 21:16:59 -06:00
ca2a024d04fixed up thread/spider log msgs. fixed core from calling fprintf in alarm signal missed quickpoll handler.
mwells
2013-08-29 21:15:42 -06:00
e925012dcechange a couple of possible reserved names in C++ to non-reserved names. #define _ADDRESS_H_ to _GB_ADDRESS_H_ etc.
mwells
2013-08-28 22:59:01 -06:00
82ee2dfed7fix cores when spider is unzipping gzipped web pages.
mwells
2013-08-28 22:49:22 -06:00
80179525c1when using pthreads block SIGIO so it does not silently kill the gb process because we no longer have a handler for it because it was bogging down the cpu because it went off every time a udp datagram was sent/received and it seemed to have a ton of overhead with it. SIGIO used to be sent when the signal queue was full so we'd resort to polling the file descriptors, so i'm not sure how this will affect us. also updated Threads.cpp to use getpidtid() instead of getpid() to get the thread id when using pthreads, not the process id. using pthreads is now default behaviour even though they suck. we used to use clone() but the newer stuff doesn't allow us to override errno_location anymore.
mwells
2013-08-21 15:01:26 -06:00
6332de2dafadded link to compare.html comparison to SOLR into documentation.
mwells
2013-08-21 13:14:17 -06:00
37a6549a58updates to developer.html developer documentation. removed a lot of obsolete information. still needs more work.
mwells
2013-08-21 13:09:55 -06:00
8971d9b932comment our urldb from developer.html since no longer used.
mwells
2013-08-21 08:59:51 -06:00
6cf0497c2cadded a little posdb documentation to developer.html. posdb replaced indexdb as the new index because it has word position info as well as word field info.
mwells
2013-08-21 08:40:28 -06:00
a2a57addd9try fixing the cpu being slammed in the sigiohandler. seems like signals meaning might have changed in the kernel, etc. over the years. fixed Loop.cpp.
mwells
2013-08-20 14:12:44 -06:00
a270a9bc91updated README.md to reference compare.html
mwells
2013-08-19 17:20:30 -06:00
7d3cc672c8use ./gb blaster -u <fileofurls> to just inject urls, but use -i to also add the outlinks to spiderdb.
mwells
2013-08-19 16:33:27 -06:00