Commit Graph

  • c77453348f Merge branch 'master' into diffbot Matt Wells 2013-09-18 09:23:48 -07:00
  • d6815f2c9d if family filter enabled (&ff=1) then prepend "gbadult:0 |" to the query to restrict to non-adult pages. mwells 2013-09-18 00:11:55 -06:00
  • a0032e0eb7 added another log statement for when debugging the adult content detectory. we err on the side of caution for the most part. mwells 2013-09-18 00:06:21 -06:00
  • 119a4c0c22 fix adult content detector mwells 2013-09-17 23:53:17 -06:00
  • 5ec3803312 fix core in hashing gbisadult:[0|1] term. mwells 2013-09-17 23:27:31 -06:00
  • 3005f904c7 index gbisadult:1 if adult content gbisadult:0 if not. Matt Wells 2013-09-17 22:05:47 -07:00
  • 10fcfb6987 minor updates Matt Wells 2013-09-17 17:32:49 -07:00
  • b8590d7df9 do not show json pages if searching pages. Matt Wells 2013-09-17 17:23:58 -07:00
  • 7fa4138d1c fix Next 10 link Matt Wells 2013-09-17 17:19:41 -07:00
  • 98caa3225a fix query prepend logic for json searches Matt Wells 2013-09-17 17:16:39 -07:00
  • 017a0febef fix api dropdown selection. Matt Wells 2013-09-17 16:38:56 -07:00
  • 5e3b727eb5 crawlbot api fixes. Matt Wells 2013-09-17 16:30:57 -07:00
  • b38d54cef9 save crawlinfo as binary so its easier to not miss anything. Matt Wells 2013-09-17 16:07:59 -07:00
  • 2beff7f7d8 crawlbot api updates Matt Wells 2013-09-17 15:59:50 -07:00
  • e50da4d012 crawlbot api fixes Matt Wells 2013-09-17 15:47:44 -07:00
  • c16fe8601b more crawlbot api fixes Matt Wells 2013-09-17 15:32:28 -07:00
  • e7151e6cc6 fix bug with spiders not coming on. Matt Wells 2013-09-17 14:35:48 -07:00
  • c81f700bf0 get reset collection kinda working. Matt Wells 2013-09-17 14:13:44 -07:00
  • 4321f02e4e trying to get reset collection working Matt Wells 2013-09-17 12:21:09 -07:00
  • fff8b80969 get collection delete working Matt Wells 2013-09-17 11:27:31 -07:00
  • 63973cf9c0 get "add new collection" working. Matt Wells 2013-09-17 10:43:23 -07:00
  • 02bf6ab3cc new crawlbot api. not backwards compatible any more. Matt Wells 2013-09-17 10:25:54 -07:00
  • f34a7f44ab compiler flag fix for xmldoc.o mwells 2013-09-16 22:35:16 -06:00
  • afd1b3a9a2 added Diffbot.h mwells 2013-09-16 21:42:48 -06:00
  • fc692202ba fix integration of urls filters into crawlbot page Matt Wells 2013-09-16 16:27:48 -07:00
  • e7ed9254d4 formatting... Matt Wells 2013-09-16 15:33:45 -07:00
  • 1a780d1f4a pretty up a little Matt Wells 2013-09-16 15:18:55 -07:00
  • a034604cef clean up to remove g_conf.m_useDiffbot Matt Wells 2013-09-16 15:00:43 -07:00
  • cb9969ad22 fix token bug Matt Wells 2013-09-16 14:38:29 -07:00
  • 3dfba4de69 doc updates Matt Wells 2013-09-16 14:29:01 -07:00
  • 4c11265a98 more updates to crawlbot api Matt Wells 2013-09-16 13:59:11 -07:00
  • 676437c3c4 more universal api updates. Matt Wells 2013-09-16 11:42:04 -07:00
  • 04f7774543 lower spider crawl info stats threshold Matt Wells 2013-09-16 11:27:09 -07:00
  • df96f81e78 fix spidering and other things. Matt Wells 2013-09-16 11:22:07 -07:00
  • f974d6a47b fixes for crawlbot universal api. Matt Wells 2013-09-16 10:49:37 -07:00
  • a50898649b various fixes. Matt Wells 2013-09-16 10:16:49 -07:00
  • 9db501d91c resolve merge conflict for nullTerm() Matt Wells 2013-09-16 09:06:33 -07:00
  • 78a334198b Merge branch 'master' into diffbot Matt Wells 2013-09-16 09:05:37 -07:00
  • 3ac79de92e fix type adurl -> addurl. Matt Wells 2013-09-16 08:11:06 -07:00
  • e6f87f5049 do not send email alerts to sysadmin@gigablast. Matt Wells 2013-09-16 08:10:18 -07:00
  • 5deda56ede minor documentation updates. Matt Wells 2013-09-15 22:16:14 -07:00
  • 3fdbae4b05 admin.html documentation update. Matt Wells 2013-09-15 22:05:01 -07:00
  • 68db2e6cc6 fix bug when checking the delete checkbox on the injection page. Matt Wells 2013-09-15 21:47:42 -07:00
  • 965e23f192 fix core from hashtablex::set() not getting enough buf space. now we force it to allocate a minimum of 32 slots to fix another bug where it was re-allocating immediately upon adding a key because growTable() is ALWAYS called if there are less than 20 slots! Matt Wells 2013-09-15 21:15:58 -07:00
  • 991e2f30f7 speed up whitelist hashtable like 20x using hashtable key magic. Matt Wells 2013-09-15 21:10:53 -07:00
  • 928dc36a03 get "&site=abc.com+xyz.com"... working to restrict search results to specified sites. tested a little. Matt Wells 2013-09-15 20:16:48 -07:00
  • 2211881e59 take apt-get install ssl stuff out of admin.html installation instructions since we supply the ssl headers now. mwells 2013-09-15 18:27:47 -06:00
  • 01c2a6d381 we already include our own 32-bit libssl.a and libcrypto.a so we can ensure stability. so we have to include the header files as well really. mwells 2013-09-15 18:25:49 -06:00
  • 107037c6a2 new &sites=xyz.com+abc.com+... functionality compiles ok. mwells 2013-09-15 18:14:32 -06:00
  • b684414e16 almost done adding support for whitelists. i.e. list of sites to restrict search results to, for instance. mwells 2013-09-15 15:15:56 -06:00
  • 7ecffec40f universal api updates Matt Wells 2013-09-13 18:10:03 -07:00
  • d982997b0c streamline crawl stats. Matt Wells 2013-09-13 17:34:39 -07:00
  • 93ce424d99 start working on the main gui for crawlbot which is /crawlbot Matt Wells 2013-09-13 16:22:07 -07:00
  • 6b330da240 cleanup warnings in log. Matt Wells 2013-09-13 14:37:35 -07:00
  • eb65b9265d call diffbot /api/analyze if classify is true or api = "all" now. it will return "type": in the json to indicate page type. basically, it classifies the page. Matt Wells 2013-09-13 14:13:56 -07:00
  • 19056fc3f2 show "processed" instead of "matched". other fixes for spider stats. add new crawl stats. attempts and successes. Matt Wells 2013-09-13 11:51:55 -07:00
  • e3e6551e23 fix diffbot bugs. Matt Wells 2013-09-13 11:34:40 -07:00
  • 7dd647c222 trying to fix nukeJSON code. Matt Wells 2013-09-13 10:38:34 -07:00
  • ef3990da98 now when re-indexing an existing xml doc we first call nukeJSONObjects() to delete any pages that it indexed from the json objects it had in its diffbot reply. in this way it can then re-add them if its new diffbot reply has them again. Matt Wells 2013-09-13 10:00:21 -07:00
  • a412c798bf Merge branch 'master' into diffbot Matt Wells 2013-09-13 09:24:28 -07:00
  • 5dc7bd2ab4 integrate diffbot from svn back into git. Matt Wells 2013-09-13 09:23:18 -07:00
  • e152205765 make depend update mwells 2013-09-09 02:37:47 -06:00
  • 1d63aa936c remove plotter.h includes causing compiler errors on some machines. Matt Wells 2013-09-09 01:25:00 -07:00
  • 76b390aea2 fix typo Matt Wells 2013-09-08 19:51:57 -07:00
  • d930a833cc try to fix compiler error related to bad delete function override. added "throw()" before the first "{" in the function body. mwells 2013-09-08 20:15:39 -06:00
  • 828345a4c7 fix compiler warning in types.h. mwells 2013-09-08 20:00:52 -06:00
  • e1968b2237 Merge branch 'master' of git@github.com:gigablast/open-source-search-engine Matt Wells 2013-09-08 18:43:00 -07:00
  • cecc655eac minor ifdef fix. Matt Wells 2013-09-08 18:42:44 -07:00
  • 657d669ec8 exclude events and seo functionality. most people want this for web search so it should be a non-issue. mwells 2013-09-08 17:07:42 -06:00
  • 34b6d3e74a fixed some cores. brought in fixes from old repo. mwells 2013-09-08 16:16:13 -06:00
  • dcf45dd69d dump out doledb to disk when it has more than 50,000 negative keys to avoid positive/negative key annihilations delays. mwells 2013-09-08 15:09:54 -06:00
  • 03706131fe documentation updates in Spider.h. Matt Wells 2013-09-08 13:42:02 -07:00
  • 54c9353dbd try to fix core from g_inSigHandler being set. it should never be set since we do not use real time signals any more mwells 2013-09-08 12:34:37 -06:00
  • 0581f86265 fix core from calling a gettime related function from a pthread when a signal handler from the main thread was in use and POSSIBLY in the same function when the signal went off. different threads should be able to access that function just fine i'd imagine. mwells 2013-09-06 15:39:53 -06:00
  • 7aa81abf91 use the "onsite" keyword in your url filters instead of this "only spider links from same host" switch to keep things simpler. mwells 2013-09-06 09:37:17 -06:00
  • c58df10155 fix major bug causing spiders not to work. Matt Wells 2013-09-04 11:01:24 -07:00
  • 91c4e768b1 more family filter fixes mwells 2013-09-01 18:28:49 -06:00
  • aaf333c46c try to get family filter (&ff=1) working again to filter out adult search results. mwells 2013-09-01 18:22:38 -06:00
  • afbd1e2b96 fix core from trying to get the time while in a sig handler. getTime() is not async safe. mwells 2013-09-01 12:55:22 -06:00
  • 93dfb0cfd4 fix for the "spiders stuck" fix. mwells 2013-08-31 11:25:26 -06:00
  • 5e0a53b909 minor print change mwells 2013-08-31 10:57:36 -06:00
  • af46945403 show more info when dumping doledb. mwells 2013-08-31 10:55:05 -06:00
  • 9696c7936a Merge branch 'master' into diffbot Matt Wells 2013-08-30 16:33:00 -07:00
  • 94e6492916 removed MAX_COLL_RECS so we can have unlimited collections, really limited by the sizeof(collnum_t) only now, which is 16bits, 15bits unsigned, which is the limitation. can always expand this so we can have more than 32k collections. Matt Wells 2013-08-30 16:20:38 -07:00
  • f6bcaeb76a minor fix. mwells 2013-08-30 00:16:30 -06:00
  • 900bbf8fba try to fix the bug of the spiders kinda getting stuck and now spidering to their max potential because of doledb record annihilations at the top of the spider priority queue in spiderdb of SpiderRequests. was causing lots of re-reads in Msg5.cpp of doledb, like over 300 rounds, very slow. mwells 2013-08-29 21:59:02 -06:00
  • 2e9c8f7c6e Merge branch 'master' of github.com:gigablast/open-source-search-engine mwells 2013-08-29 21:17:46 -06:00
  • 84fae9a3c6 Fix issue of reading spiderrequests from doledb at the very first key in spiderdb. causes lots of positive/negative key annihilations. we end up re-reading like 300 times in some cases just to get a url from a doledb priority. mwells 2013-08-29 21:16:59 -06:00
  • ca2a024d04 fixed up thread/spider log msgs. fixed core from calling fprintf in alarm signal missed quickpoll handler. mwells 2013-08-29 21:15:42 -06:00
  • e925012dce change a couple of possible reserved names in C++ to non-reserved names. #define _ADDRESS_H_ to _GB_ADDRESS_H_ etc. mwells 2013-08-28 22:59:01 -06:00
  • 82ee2dfed7 fix cores when spider is unzipping gzipped web pages. mwells 2013-08-28 22:49:22 -06:00
  • 80179525c1 when using pthreads block SIGIO so it does not silently kill the gb process because we no longer have a handler for it because it was bogging down the cpu because it went off every time a udp datagram was sent/received and it seemed to have a ton of overhead with it. SIGIO used to be sent when the signal queue was full so we'd resort to polling the file descriptors, so i'm not sure how this will affect us. also updated Threads.cpp to use getpidtid() instead of getpid() to get the thread id when using pthreads, not the process id. using pthreads is now default behaviour even though they suck. we used to use clone() but the newer stuff doesn't allow us to override errno_location anymore. mwells 2013-08-21 15:01:26 -06:00
  • 6332de2daf added link to compare.html comparison to SOLR into documentation. mwells 2013-08-21 13:14:17 -06:00
  • 37a6549a58 updates to developer.html developer documentation. removed a lot of obsolete information. still needs more work. mwells 2013-08-21 13:09:55 -06:00
  • 8971d9b932 comment our urldb from developer.html since no longer used. mwells 2013-08-21 08:59:51 -06:00
  • 6cf0497c2c added a little posdb documentation to developer.html. posdb replaced indexdb as the new index because it has word position info as well as word field info. mwells 2013-08-21 08:40:28 -06:00
  • a2a57addd9 try fixing the cpu being slammed in the sigiohandler. seems like signals meaning might have changed in the kernel, etc. over the years. fixed Loop.cpp. mwells 2013-08-20 14:12:44 -06:00
  • a270a9bc91 updated README.md to reference compare.html mwells 2013-08-19 17:20:30 -06:00
  • 7d3cc672c8 use ./gb blaster -u <fileofurls> to just inject urls, but use -i to also add the outlinks to spiderdb. mwells 2013-08-19 16:33:27 -06:00
  • 3550bf2d8a compare.html update. mwells 2013-08-19 16:21:01 -06:00