Commit Graph

  • bc2b9d6179 bio.html updates Matt Wells 2014-02-17 15:29:08 -0700
  • f942183104 ignore maxtocrawl for bulk jobs too Matt Wells 2014-02-16 22:24:17 -0800
  • a4deb7ff08 exempt bulk jobs from maxtoprocess Matt Wells 2014-02-16 22:14:43 -0800
  • 9c9d5fff98 print out content type in caps with maroon bg in serps. use empty site patterns to mean no restriction, not "*" anymore for simplicity. Matt Wells 2014-02-16 22:47:02 -0700
  • 0b5cd6d3f9 more parm fixes Matt Wells 2014-02-16 22:18:39 -0700
  • 48315f6dc3 parm fixes Matt Wells 2014-02-16 22:13:27 -0700
  • 725b6189a7 show user's ip in master ips description so they can add it to the list easily. Matt Wells 2014-02-16 21:56:31 -0700
  • 9d0dca71db fix rapid coll delete bug some more. Matt Wells 2014-02-16 20:13:06 -0800
  • def7822a22 multiple red boxes for clarity Matt Wells 2014-02-16 20:33:51 -0700
  • ce652462b0 add color coded circles to coll nav bar. disk usage red box. Matt Wells 2014-02-16 19:59:53 -0700
  • 88a151f1d9 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-02-16 16:02:00 -0800
  • c691b2dd5f hopcount precedence fix Matt Wells 2014-02-16 16:01:29 -0800
  • a4b6716623 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-02-16 15:57:46 -0800
  • 29c9e49935 do not do link loop detection if doing a custom crawl. Matt Wells 2014-02-16 15:57:32 -0800
  • f8135e628e fall back to hop count if priority is tied (and both are due to be spidered). defaults back to breadth first like it was doing before. Matt Wells 2014-02-16 15:52:08 -0800
  • 0a4963f597 do not allow spot/seeds to be added to collnum being repaired or rebuilt. Matt Wells 2014-02-16 15:18:50 -0800
  • fe63371622 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-02-16 13:39:02 -0800
  • 4930243de3 minor updates Matt Wells 2014-02-16 13:38:54 -0800
  • fe0f2d3537 allow coll delete if not the one being repaired Matt Wells 2014-02-16 10:55:34 -0800
  • 32526a9b25 more checksum fixes for json. fixes for repair/rebuild procedure. Matt Wells 2014-02-16 10:46:41 -0800
  • df59d3946a fix content hash issues for json. do not hash url/resolved_url/html fields. do exact order-independent hashes of remaining field/value pairs. used for setting EDOCUNCHANGED and doing spidertime/querytime deduping. also do not index "html" json field because it is huge, slow and redundant. convert "date" field into a number so we can sort/constrain by article pub date. Matt Wells 2014-02-15 14:40:56 -0800
  • 734ce1fc55 fix core from a high priority injection insert records at the same time as a lower priority spider. Matt Wells 2014-02-14 10:51:02 -0800
  • 3271f22995 Merge branch 'diffbot-testing' into diffbot Matt Wells 2014-02-13 11:25:59 -0800
  • dc8b9090e8 fix out of alloc slots core Matt Wells 2014-02-13 11:21:39 -0800
  • 08b103f3a4 Merge branch 'diffbot-testing' into diffbot Matt Wells 2014-02-13 10:11:56 -0800
  • c3d8a143be fix bug of process regex being ignored when crawl regex was specified. Matt Wells 2014-02-13 10:06:14 -0800
  • 4eee547391 do not do fuzzy deduping if &icc=1 (include cached copy) is true for search results. Matt Wells 2014-02-13 08:51:03 -0800
  • cd6069e5a6 send single space to socket if not streaming and search results still not ready after 10 seconds. send it every 10 seconds to prevent client from closing socket. sped up all downloads, json and csv, but not doing "fuzzy" deduping of search results, but just deduping on page content hash. added TcpSocket::m_numDestroys to ensure we do not send heartbeat on a socket that was closed and re-opened for another client. Matt Wells 2014-02-13 08:45:13 -0800
  • 5f0ebb4aef fix stack overflow Matt Wells 2014-02-13 00:01:49 -0800
  • 5db23c2eec fi infinite loop core thing. Matt Wells 2014-02-12 21:43:23 -0800
  • a9737ea97d Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-02-12 21:20:01 -0800
  • d42e2377e7 return json download as search results now. all smokes have passed. Matt Wells 2014-02-12 21:19:32 -0800
  • 8bb17de3c5 pass smoketest: TestOnlyProcessIfNew.testNotUpdatedContent so diffbot reply will not be update in the index if it is unchanged thereby keeping lastCrawlTimeUTC the same. Matt Wells 2014-02-12 18:42:14 -0800
  • 25eae3da39 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-02-12 13:21:57 -0800
  • 0e48bbcea9 fix a core from bad return values Matt Wells 2014-02-12 13:21:30 -0800
  • ca4aafa8a6 added host disk usage redbox and stats. Matt Wells 2014-02-12 09:47:44 -0700
  • eb044c765c remove login link on root pages. add hand cursor to logout link. Matt Wells 2014-02-12 00:47:58 -0700
  • e5408d6596 minor fix Matt Wells 2014-02-12 00:37:54 -0700
  • 68a14de031 security admin fixes Matt Wells 2014-02-12 00:36:09 -0700
  • 3b0a571cea fix security system to actually work now Matt Wells 2014-02-12 00:06:00 -0700
  • 609a344a57 fix counting bug in array parms Matt Wells 2014-02-11 22:28:04 -0700
  • 9a76ff2531 minor parm updates Matt Wells 2014-02-11 20:50:36 -0700
  • 51d514f276 use supplied mime if supplied when injecting Matt Wells 2014-02-11 13:02:30 -0800
  • c9be18615c more parm saving fixes Matt Wells 2014-02-10 22:04:22 -0700
  • 2efbb602df fix saving parms bug Matt Wells 2014-02-10 21:52:29 -0700
  • 953b7c558d parm updates Matt Wells 2014-02-10 21:45:03 -0700
  • 69fa6662bc EDOCUNCHANGED fixes for diffbot Matt Wells 2014-02-10 16:23:39 -0800
  • 44a9e08d38 fix EDOCUNCHANGED logic. Matt Wells 2014-02-10 14:56:22 -0800
  • debd9089e8 better logging msg when updating parm. Matt Wells 2014-02-10 11:29:24 -0800
  • c041d47a0c html formatting updates Matt Wells 2014-02-10 00:15:04 -0700
  • b309d84245 html updates Matt Wells 2014-02-09 23:19:43 -0700
  • 9f0d2ad82e parm updates Matt Wells 2014-02-09 23:05:36 -0700
  • cdf2550136 more parm fixes Matt Wells 2014-02-09 22:51:16 -0700
  • c2c3fe993c parm fixes for basic pages Matt Wells 2014-02-09 22:25:08 -0700
  • d2b473e554 checkpoint Matt Wells 2014-02-09 19:09:44 -0700
  • 91ea5384a6 formatting changes Matt Wells 2014-02-09 16:57:39 -0700
  • ecdd167d9b code checkpoint Matt Wells 2014-02-09 16:41:43 -0700
  • f420bd2769 checkpoint Matt Wells 2014-02-09 15:09:48 -0700
  • c9ef525338 code checkpoint Matt Wells 2014-02-09 12:55:45 -0700
  • 6c9a44367f code checkpoint Matt Wells 2014-02-09 12:38:40 -0700
  • e60576c8eb another code checkpoint Matt Wells 2014-02-08 22:57:30 -0700
  • 156b50240a code checkpoint Matt Wells 2014-02-08 16:24:33 -0700
  • e593b6e1de basic controls code checkpoint. Matt Wells 2014-02-08 15:10:06 -0700
  • dabd691626 basic admin controls page structure Matt Wells 2014-02-08 00:34:45 -0700
  • fc47c18aec new printadmintop functionality. Matt Wells 2014-02-07 23:08:04 -0700
  • b634d06287 fix some cores. use olddoc contenthash for msg13 call for EDOCUNCHANGED errors. Matt Wells 2014-02-07 18:28:09 -0800
  • 252d24dc2a fix core of page spiders Matt Wells 2014-02-07 10:46:10 -0800
  • 573a04bccd fix bug in gbminint. Matt Wells 2014-02-06 21:36:47 -0800
  • edef3acf37 remove bugg line Matt Wells 2014-02-06 21:19:37 -0800
  • b3453c248e take out buggy statement. Matt Wells 2014-02-06 21:16:30 -0800
  • 7b42d2848d formatting fixes Matt Wells 2014-02-06 21:06:31 -0800
  • 2d4af1aefe index numbers as integers too, not just floats so we can sort by spider date without losing 128 seconds of resolution. Matt Wells 2014-02-06 20:57:54 -0800
  • 63e95c3b2d show lastSpidered time at end of json item. it's a float so we should probably store it as an int as well. we lose 128 seconds of resolution. Matt Wells 2014-02-06 18:56:38 -0800
  • 8d534b8ed8 many more fixes for streaming mode Matt Wells 2014-02-06 18:21:22 -0800
  • 874311ae52 fixes for streaming mode. Matt Wells 2014-02-06 16:28:42 -0800
  • 5787b15884 Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-02-06 15:26:21 -0800
  • 8f6a4ee9b6 do not save collrecs all the time. stop superflusouly setting m_needsSave. try to stop evaluating crawls that have completed because of lack of urls. we still need to fix it so if they change url filters so that more urls become available, that we retry! Matt Wells 2014-02-06 15:27:49 -0800
  • 845611ae1b &stream=1 stream mode fixes. Matt Wells 2014-02-06 15:23:53 -0800
  • 4cfe69a96f minor link updates Matt Wells 2014-02-06 14:41:33 -0800
  • f9dbd64056 get streaming time sliced results working Matt Wells 2014-02-06 14:25:44 -0800
  • 106077c163 fix spiderrequest deduping some more Matt Wells 2014-02-06 09:47:18 -0800
  • 4029b0b937 more faster spider fixes. tried to fix corrupt rdbcache. Matt Wells 2014-02-06 09:25:27 -0800
  • 9145d89e3f raise spiderdb minfilestomerge from 2 to 3 to reduce merging since we allow many urls in doledb for the same firstip now Matt Wells 2014-02-05 19:35:19 -0800
  • 203cdc5f99 delete from winnertable when deleting from winnertree Matt Wells 2014-02-05 19:12:33 -0800
  • 25e7ba5ef8 fix too many spiders out per ip some more Matt Wells 2014-02-05 17:11:45 -0800
  • 2842350e6d gb.conf spiders back on Matt Wells 2014-02-05 16:59:06 -0800
  • 5c8b9af1d3 fix rdbcache corruption from -O2 compile bug. fix too many spiders per ip bug! Matt Wells 2014-02-05 16:58:21 -0800
  • 951e9d5068 wait 180 secs for diffbot reply Matt Wells 2014-02-05 15:46:26 -0800
  • c60dcf4ecb show userobots for bulk jobs Matt Wells 2014-02-05 15:45:39 -0800
  • d9f0d57c0c core fixes. csv fixes. Matt Wells 2014-02-05 14:56:22 -0800
  • ecc10c2cb9 dup cache fixes. do not add dups to spiderdb either. Matt Wells 2014-02-05 14:09:35 -0800
  • 7806a8a68c fix excessive dupcache deduping. Matt Wells 2014-02-05 13:41:15 -0800
  • c159f80f05 MAX_WINNER_NODES back to 40. Matt Wells 2014-02-05 13:25:04 -0800
  • 9c26b85c2f fixed contenthash32 logic for json objects. fixed hashing of numbers/bools for json objects. added m_dupCache to reduce spiderrequests added to spiderdb. do not add urls to waitingtree if ufn is obviously filtered/banned. do not spider spiderrequest from doledb is maxoutperip would be violated. Matt Wells 2014-02-05 13:22:03 -0800
  • d86c7b8fbb do not store 40 urls in doledb if firstip does not have that many urls to begin with. it's better to just store one url in doledb for small domains. Matt Wells 2014-02-04 20:39:46 -0800
  • bda134268e added winnertable to avoid dups in winnertree. Matt Wells 2014-02-04 20:09:43 -0800
  • 053a9b9a0d spiders seem to be working somewhat now. Matt Wells 2014-02-04 18:23:37 -0800
  • 189999509b code checkpoint. time slicing, faster spider code compiling. now needs debug. Matt Wells 2014-02-04 17:34:43 -0800
  • 7f4d3205e5 streaming results code checkpoint. Matt Wells 2014-02-04 17:05:43 -0800
  • 3312400fee checkpoint for faster spider code. Matt Wells 2014-02-04 16:15:27 -0800